Professional Documents
Culture Documents
For Seminar Presentation-Edited (Feb5)
For Seminar Presentation-Edited (Feb5)
Table of Contents
1. Abstract: .............................................................................................................................. 3
2. Introduction ......................................................................................................................... 4
3. Statement of the problem..................................................................................................... 5
4. Objective ............................................................................................................................. 7
4.1. General Objective ..................................................................................................................... 7
6. Methods ............................................................................................................................. 23
6.1. Machine Learning ML .............................................................................................................23
7. Conclusion ......................................................................................................................... 32
8. References ......................................................................................................................... 33
List of Figures
List of Tables
1. Abstract:
This report presents an overview of Data Mining Techniques and to show you some of the
applications of Artificial Neural Network (ANU) based Data Mining (DM) techniques
which are used in South Nation Nationality People Region (SNNPR) Education Bureau
(EB) Schools’ Inspection’s (SI) data. The bureau does SI every year with clearly Stated
twenty-six (26) standards. The standards focus to different departments’ jobs. The results
of inspection will be delivered to the departments accordingly. The major challenge of SI
is getting enough knowledge to make big decisions and that will help the departments to
know about the school’s level. There are many technologies available to data mining
practitioners, including Artificial Neural Networks, Regression, and Decision Trees. On
this seminar I will show application of Data Mining and technique to monitor SI. This
report also provides a brief overview of artificial neural networks and questions their
position as an applicable tool in data mining.
Key words: Data mining, Machine learning, Artificial neural network, Knowledge
discovery, School Inspection, supervise vs unsupervised
2. Introduction
The development of Information Technology has generated large among of databases and
huge data in various areas. Nowadays corporate and organizations are accumulating data
at an enormous rate and from a very broad variety of sources. A lot of relational database
servers have been built to store such massive quantities of data. As the matter of fact, the
data itself is critical to a company’s growth. It contains knowledge that could lead to
important business decisions that bring business to the next level. These data have never
been examined in a superficial manner. It is becoming data rich but knowledge poor. In
other words, “We are drowning in data, but starving for knowledge!”
We need information but what we have is a huge amount of data flooding around
companies, organizations even individuals. Because of the amount of data is so enormous
that humans cannot process it fast enough to get the information out of it at the right time,
the machine learning technology has been established to solve this problem potentially.
The research in databases and information technology has given rise to an approach to store
and manipulate this precious data for further decision making.
Data mining ‘is the term used to describe the process of extracting value from a database.
A Data-warehouse is a location where information is stored. The type of data stored
depends largely on the type of industry and the company. Data mining (the analysis step
of the "Knowledge Discovery in Databases" process, or KDD), a relatively young and
interdisciplinary field of computer science, & the process that attempts to discover patterns
in large data sets. It utilizes methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems. The overall goal of the data mining process is to
extract information from a data set and transform it into an understandable structure for
further use. Apart from the raw analysis step, it involves database and data management
aspects, data pre-processing, model and inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered structures, visualization, and
online updating.
Data mining is the business of answering questions that you've not asked yet. Data mining
reaches deep into databases. Data mining tasks can be classified into two categories:
Descriptive and Predictive data mining. Descriptive data mining provides information to
understand what is happening inside the data without a predetermined idea. Predictive data
mining allows the user to submit records with unknown field values, and the previous
patterns discovered form the database. Data mining model can be categorized according to
the tasks they perform: Classification and Prediction Clustering, Association Rules.
Classification and prediction are a predictive model but clustering and association rules are
descriptive models.
The most common action in data mining is classification. It recognizes patterns that
describe the group to which an item belongs. It does this by examining existing items that
already have been classified and inferring a set of rules. Similar to classification is
clustering. The major difference being that no groups have been predefined. Prediction is
the construction and use of a model to assess the class of an unlabeled object or to assess
the value or value ranges of a given object is likely to have. The next application is
forecasting. This is different from predictions because it estimates the future value of
continuous variables based on patterns within the data.
Four things are required to data-mine effectively: high-quality data, the “right” data, an
adequate sample size and the right tool. There are many tools available to a data mining
practitioner. These include decision trees, various types of regression and neural networks.
Strengthen an education system means to align its governance, management, financing and
performance incentive mechanisms to produce quality education for all. The existence
good information about the process of teaching and learning has a huge contribution for
the quality provision of education to enhance learning outcome by improving school
performance.
One of the best tools of quality of education is SI. It measures each school performance,
and categorize them in to one of the four categories. These are: very low (VLL) level, low
level (LL), high level (HL) and very high level (VHL). The overall school performance
result is obtained from three different data subcategories. First one is the input data
category, measures the resources that schools are using as an inputs and it has seven (7)
standards. The second one is processes data category, measures every activities that a
school is doing and it is measured with fourteen (14) standards. The third one is the output
data category, it measures the result of schools recording it is measured with five (5)
standards. The sum of these three will be total performance level of the school.
SNNPR EB perform SI every year and collect schooled data to analyze and delivers the
performance result for departments according to the standards belong to. The departments
then use the information gained form the data for the decision making and for planning to
improve the Schools level those are under minimum standard. However, now day’s
knowledge gained from only data is insufficient and misleading the bureau in some
decision. The data must be mined in order to investigate the association among the standard
results of the schools.
In this Seminar, using some data mining techniques, I’m going to present data mining work
on 2010 SNNPR SI data. Specifically, like: preparation data (data preprocessing),
removing an important attribute, converting data in to better form without affecting the
data, discretizing data, testing some algorithms, select best algorithm for data mining.
flittering using different algorithms and comparing performance of some with respect to SI
data. During DM I’ll use WEKA frame work for demonstration. In SI Data mining I’ll use
WEKA for attribute selection, clustering association rules filters purpose and estimators.
Finally, I’ll prepare some subjection for School level Improvement for data analysis.
4. Objective
4.1. General Objective
The objective of this seminar is to discover knowledge using artificial neural network based
data mining techniques which are applicable in various scenarios.
5. Data mining
The process of searching and analyzing large amounts of data is called ‘‘data mining’’. The
large collections of data are the potential lodes of valuable information but like in real
mining, the search and extraction can be a difficult and exhaustive process. Data Mining is
a knowledge discovery process of extracting previously unknown, actionable information
from very large databases. In details it is the non-trivial extraction of implicit, previously
unknown and potentially useful information from data. In other words, it is the search from
relationships and global patterns that exist in large databases, but are ‘‘hidden’’ among the
vast amounts of data. These relationships represent valuable knowledge about the database
and objects in the world.
The Knowledge Discovery in Databases (KDD) process is commonly defined with the
stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation Evaluation.
▪ Relational databases
▪ Data warehouses
▪ Transactional databases
▪ Advanced DB and information repositories
▪ Object-oriented and object-relational databases
▪ Spatial databases
▪ Time-series data and temporal data
▪ Text databases and multimedia databases
▪ Heterogeneous and legacy databases
▪ Data cleaning: The task of this step is to remove noise and inconsistent data.
▪ Data integration: In this step, multiple data sources like the ones mentioned in the
section above can be combined to an integrated collection of data.
▪ Data selection: All the data relevant to the analysis task is retrieved from the
database in this step.
▪ Data transformation: The data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
▪ Data mining: The critical step where intelligent methods are applied in order to
extract data patterns.
▪ Pattern evaluation: This step is deployed to identify the truly interesting patterns
representing knowledge based on certain measures.
▪ Knowledge presentation: In the final step, various visualization and knowledge
representation techniques are used to present the mined knowledge to the user.
Data mining has five main functions:
The model derivation stage focuses on choosing learning samples, testing samples and
learning algorithms. Due to the large volume of available data, data mining may be done
on subsets of the data from the data warehouse. An appropriate data sample is selected
from the data in the warehouse and is checked for descriptiveness. This process may have
to iterate a few times before a suitable sample set can be selected. The selected sample
dataset forms the training data for the data-mining algorithm. The data-mining process is
viewed in our framework as the derivation of an appropriate knowledge model of the
patterns in the data that are interesting to the user. The algorithm for model derivation,
together with the guidance provided by the user, will generally produce several models of
the information contained in the data. The data-mining algorithms use guidance from the
analyst to decide various parameters of the model being learned from the data, such as its
accuracy and prevalence, and to control the computational complexity of the learning
process.
Among all the models generated, the users may only select a few interesting models to be
included in their applications. The usage and maintenance phase is concerned with
monitoring of database updates and continued validation of patterns learned in the past.
Even though the learning process may have user guidance, not all the knowledge models
generated will have business applications. Only the interesting models are selected and
applied performing business tasks. Another important task in this stage of the life cycle is
to continuously monitor the validity of the knowledge models in the context of changes to
data in the warehouse. When the population in the warehouse shifts significantly, the
previously learned models will no longer be applicable, and new models will have to be
derived. We may also be able to learn new models incrementally from the new data.
i. Classification Analysis
This analysis is used to retrieve important and relevant information about data, and
metadata. It is used to classify different data in different classes. Classification is similar
to clustering in a way that it also segments data records into different segments called
DM: Seminar: Artificial Neural Network Based Data Mining Techniques
11 of 33
INFOLINK U.COLLEGE - Data Mining (Seminar)
classes. But unlike clustering, here the data analysts would have the knowledge of different
classes or cluster. So, in classification analysis you would apply algorithms to decide how
new data should be classified. Fraud detection and credit risk applications are particularly
well suited to this type of analysis. This approach frequently employs decision tree or
neural network-based classification algorithms.
The data classification process involves earning and classification. In Learning, the training
data are analyzed by classification algorithm. In classification, test data are used to estimate
the accuracy of the classification rules. If the accuracy is acceptable, the rules can be
applied to the new data tuples Classification method makes use of mathematical techniques
such as decision trees, linear programming, neural network and statistics. In classification,
we make the software that can learn how to classify the data items into groups.
The classifier-training algorithm uses these pre-classified examples to determine the set of
parameters required for proper discrimination. The algorithm then encodes these
parameters into a model called a classifier.
The cluster is actually a collection of data objects; those objects are similar within the same
cluster. That means the objects are similar to one another within the same group and they
are rather different or they are dissimilar or unrelated to the objects in other groups or in
other clusters. Clustering analysis is the process of discovering groups and clusters in the
data in such a way that the degree of association between two objects is highest if they
belong to the same group and lowest otherwise. A result of this analysis can be used to
create customer profiling.
In clustering technique, the classes are defined and accordingly objects are put in them,
whereas in classification objects are assigned into predefined classes.
In statistical terms, a regression analysis is the process of identifying and analyzing the
relationship among variables. It can help you understand the characteristic value of the
dependent variable changes, if any one of the independent variables is varied. This means
one variable is dependent on another, but it is not vice versa. It is generally used for
prediction and forecasting.
Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning. Regression analysis is also used to
understand which independent variables are related to the dependent variable, and to
explore the forms of these relationships.
In data mining, independent variables are attributes already known and response variables
are what we want to predict. Real-world problems are very difficult to predict because they
may depend on complex interactions of multiple predictor variables. Therefore, more
complex techniques (e.g. logistic regression, decision trees, or neural nets) may be
necessary to forecast future values. The same model types can often be used for both
regression and classification. For example, the CART (Classification and Regression
Trees) decision tree algorithm can be used to build both classification trees (to classify
categorical response variables) and regression trees (to forecast continuous response
variables). Neural networks too can create both classification and regression models.
▪ Linear Regression
▪ Multivariate Linear Regression
▪ Nonlinear Regression
▪ Multivariate Nonlinear Regression
It refers to the method that can help you identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can help us
unpack some hidden patterns in the data that can be used to identify variables within the
data and the concurrence of different variables that appear very frequently in the dataset.
Association rules are useful for examining and forecasting customer behavior. This
technique is used to determine shopping basket data analysis, product clustering, catalog
design and store layout. In IT, programmers use association rules to build programs capable
of machine learning.
Association and correlation are usually to find frequent item set findings among large data
sets. This type of finding helps businesses to make certain decisions, such as catalogue
design, cross marketing and customer shopping behavior analysis. Association rules are
usually required to satisfy a user-specified minimum support and a user-specified minimum
confidence at the same time. Association rule generation is usually split up into two
separate steps: First, minimum support is applied to find all frequent item sets in a database.
Second, these frequent item sets and the minimum confidence constraint are used to form
rules. Association Rule algorithms need to be able to generate rules with confidence values
less than one. However, the number of possible Association Rules for a given dataset is
generally very large and a high proportion of the rules are usually of little (if any) value.
v. Neural networks
to computation. In most cases an ANN is an adaptive system that changes its structure
based on external or internal information that flows through the network during the learning
phase. Modern neural networks are non-linear statistical data modelling took. They are
usually used to model complex relationships between inputs and outputs or to find patterns
in data.
Neural network is a set of connected input/output units and each connection has a weight
present with it. During the learning phase, network learns by adjusting weights so as to be
able to predict the correct class label of the input tuples. Neural networks have the
remarkable ability to derive meaning from complicated or imprecise data and can be used
to extract patterns and detect trends that are too complex to be noticed by either humans or
other computer techniques. These are well suited for continuous valued inputs and outputs.
Neural networks are best at identifying patterns or trends in data and well suited for
prediction or forecasting needs.
This refers to the observation for data items in a dataset that do not match an expected
pattern or an expected behavior. Anomalies are also known as outliers, novelties, noise,
deviations and exceptions. Often, they provide critical and actionable information. An
anomaly is an item that deviates considerably from the common average within a dataset
or a combination of data. These types of items are statistically distant as compared to the
rest of the data and hence, it indicates that something out of the ordinary has happened and
requires additional attention. This technique can be used in a variety of domains, such as
intrusion detection, system health monitoring, fraud detection, fault detection, event
detection in sensor networks, and detecting eco-system disturbances. Analysts often
remove the anomalous data from the dataset top discover results with an increased
accuracy.
organizations include retail stores, hospitals, banks, and insurance companies. Many of
these organizations are combining data mining with such things as statistics, pattern
recognition, and other important took. Data mining can be used to find patterns and
connections that would otherwise be difficult to find. This technology is popular with many
businesses because it allows them to earn more about their customers and make smart
marketing decisions.
There are a number of applications that data mining has. The first is called market
segmentation. With market segmentation, we can find behaviors that are common among
customers. We can look for patterns among customers that seem to purchase the same
products at the same time. Another application of data mining is called customer churn
Customer churn allows us to estimate which customers are the most likely to stop
purchasing our products or services and go to one of our competitors. In addition to this, a
company can use data mining to find out which purchases are the most likely to be
fraudulent. For example, by using data mining in retail stores. we may be able to determine
which products are stolen the most. By finding out which products are stolen the most,
steps can be taken to protect those products and detect those who are stealing them. We
can also use data mining to determine the effectiveness of interactive marketing. Some of
the customers will be more likely to purchase the products online than offline, and we must
identify them.
While many businesses use data mining to help increase their profits, it can also be used to
create new businesses and industries. One industry that can be created by data mining is
the automatic prediction of both behaviors and trends. Using this automated prediction, we
can have an advantage over the competition Instead of simply guessing what the next big
trend will be, we can determine it based on statistics, patterns, and logic. Another
application of automatic prediction is to use data mining to book at the past marketing
strategies to determine the best one so far and the reason for it being the best. We can avoid
making any mistakes that occurred in previous marketing campaigns. Data mining is abo
a powerful tool for those who deal with finances. A financial institution such as a bank can
predict the number of defaults that will occur among their customers within a given period
of time, and they can also predict the amount of fraud that will occur as well.
Another application of data mining is the automatic recognition of patterns that were not
previously known. While data mining is a very valuable tool, it is important to realize that
it is not a complete solution. Even if an automated technolgy should be invented, it will not
guarantee the success of the company. However, it will tip the odds in our favor.
▪ Design and construction of data warehouses based on the benefits of data mining
▪ Multidimensional analysis of sales, customers, products, time, and region
▪ Analysis of the effectiveness of sales campaigns
▪ Customer retention—analysis of customer loyalty
▪ Product recommendation and cross-referencing of items
Data Mining for the Telecommunication Industry
Data collection and storage technologies have recently improved. so that today, scientific
data can be amassed at much higher speeds and lower costs. This has resulted in the
accumulation of huge volumes of high-dimensional data, stream data, and heterogeneous
data, containing rich spatial and temporal information Consequently, scientific applications
are shifting from the “hypothesize-and-test” paradigm toward a “collect and store data,
mine for new hypotheses, confirm with data or experimentation” process. This shift brings
about new challenges for data mining.
A spatial database stores a large amount of space-related data, such as maps, pre-processed
remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining
refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases.
Spatial data mining is the application of data mining methods to spatial data. The end
objective of spatial data mining is to find patterns in data with respect to geography. So far,
data mining and Geographic Information Systems (GIS) have existed as two separate
technologies, but now Data mining offers great potential benefits for GIS-based applied
decision-making.
As with relational data, we can integrate spatial data to construct a data warehouse that
facilitates spatial data mining, A spatial data warehouse is a subject-oriented, integrated,
time variant and non-volatile collection of both spatial and noncapital data in support of
spatial data mining and spatial-data-related decision-making processes.
▪ A non-spatial dimension
▪ A spatial to-non-spatial dimension
▪ A spatial to-spatial dimension
We can distinguish two types of measures in a spatial data cube:
▪ A numerical measure contains only numerical data
▪ A spatial measure contains a collection of pointers to spatial objects
Spatial database systems usually handle vector data that consist of points, lines, polygons
(regions), and their compositions, such as networks or partitions. Typical examples of such
data include maps, design graphs, and 3-D representations of the arrangement of the chains
of protein molecules.
In general data visualization and data mining can be integrated in the following ways:
▪ Data visualization
▪ Data mining result visualization
▪ Data mining process visualization
▪ Interactive visual data mining
Challenges:
▪ The Web seems to be too huge for effective data warehousing and data mining
▪ The complexity of Web pages is far greater than that of any traditional text
document collection
▪ The Web is a highly dynamic information source
▪ The Web serves a broad diversity of user communities
▪ Only a small portion of the information on the Web is truly relevant or useful
Besides mining Web contents and Web linkage structures, another important task for Web
mining is Web usage mining.
Data Mining System Products and Research Prototypes data mining systems should be
assessed based on the following multiple features: Data types, System issues, Data sources,
Data mining functions and methodologies. Coupling data mining with database and/or data
warehouse systems, Scalability, Visualization tools, Data mining query language and
graphical user interface.
6. Methods
6.1. Machine Learning ML
There are three types of machine learning
Table 1: Types of machine learning
Types of Problems
Training
The Aim
Feedback
Popular algorithms
Applications
To use all the remaining functions I convert the data format in to data required by the
software is in the “.arff” format.
Features:
I create an instance of the class to execute it. The functionality of WEKA is classified based
on the steps of Machine learning. The Classifiers class prints out a decision tree classifier
for the dataset given as input. Also a ten-fold cross-validation estimation of its performance
is also calculated. The Classifiers package implements the most common techniques
separately for categorical and numerical values
Preformance evaaluation:
▪ Classifiers.trees.J48 -C 0.25 -M 2 using 90.0% train, remainder test, classifies 247 instants
correctly and 37 incorrectly. It has 87% performance.
▪ Time taken to test model on test split: 0.23 seconds
▪ Correctly Classified Instances 247 86.97 %
▪ Incorrectly Classified Instances 37 13.03 %
Preformance evaaluation:
▪ Bayes.BayesNet using 90.0% train, remainder test, classifies 258 instants correctly and 26
incorrectly. It has 90.85% performance.
▪ Time taken to test model on test split: 0.16 seconds
▪ Correctly Classified Instances 258 90.85 %
▪ Incorrectly Classified Instances 26 9.15 %
Preformance evaaluation:
▪ Lazy.IBk using 90.0% train, remainder test, classifies 261 instants correctly and 23
incorrectly. It has 91.9% performance.
▪ Time taken to test model on test split: 0.0 seconds
▪ Correctly Classified Instances 261 91.9 %
▪ Incorrectly Classified Instances 23 8.1 %
Preformance evaaluation:
Preformance evaaluation:
Frome the above test teults we can see that Lazy.IBk using 90.0% train, remainder test,
algorithms, is the best algorithms for our dataset.
Visualizing errors
Lazy.IBk aligorithm (unsupervised data setting and split 90.0% train, remainder test) there are 23
errors which are classified under different level but they would have to be under where they are
seen below (i.e highlighted with yellow mark).
6.4. Clustering:
Clusterers. EM clustering Algorisms is used in SI data clustering demonstration, with a
ten-fold cross-validation, 10 bines and using equal frequency, estimation of its performance
is also calculated.
Ignored: GRADE
Test mode: Evaluate on training data
Association rule mining finds interesting association or correlation relationships among the
set of data items. With massive amounts of IS data continuously being collected and stored
in excel, SNNPR EB becoming interested in mining association rules from their databases.
For example, the discovery of interesting association relationships among huge amounts of
SI records can help standards categorization, cross school performance analysis, resource
distribution and conception behavior, and other management decision making processes.
For example, association rule mining for a standard progress analysis. This process
analyzes school performance habits by finding associations between the different standards
that Schools leveled every year. The discovery of such associations can help EB to develop
different strategies by gaining insight into which standards are frequently lay under
performance together by School. For instance, if schools are lay below the standard in one
of the input standards let as say standard-1 = L, how likely are they to also they lay below
in Performance standard too (and what kind of Standard)? Such information can lead to
improve the standards of schools.
7. Conclusion
This seminar report provided an overview of Data Mining Process, its techniques and
applications. The following conclusions can be drawn:
I. Data Mining is a crucial step in the Knowledge Discovery in Databases Process but can
only be performed after pre-processing and transformation.
II. Although the basic steps in data mining include data cleaning selection and
transformation; the functions and techniques are only applied in the vital step where
intelligent methods are used to detect patterns.
III. A model for Data Mining is useful for a company or a data mining practitioner as it
helps in adapting a result oriented approach.
IV. Cross Industry Standard Process for Data Mining Model is an effective approach to a
model which considers business requirements at every step.
V. Classification and Clustering techniques are popular and easily applicable in data
mining, however classification we require prior characteristic information.
VI. Artificial Neural Networks can be deployed to detect patterns and make predictions
which make them capable took in data mining. A feed forward neural network uses a
back-propagation algorithm to train itself.
VII. The application of data mining techniques along with GIS techniques makes for a
potential opportunity to explore various aspects of Spatial Data Mining.
VIII. The growth of data available for processing. as well as multimedia elements and the
world wide web leads to greater opportunities for data mining techniques. However
the pre-processing, section and transformation needs to be handled first.
In this seminar I have used WEKA framework for Data Mining on SNNPR EB IS data. I
also test the dataset with different classification algorithms and compared the performance
with respect to our dataset. I found that Lazy.IBk using 90.0% train, remainder test, is the
best performing algorithm to develop application. I also found 10 best association rules
which help for decision making and getting knowledge about the standards association.
8. References
[1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database
perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996.
[9] Portia A. Cemy, Data mining and Neural Networks from a Commercial
Perspective
{10} Bharati M. Ramageri Data Mining Techniques and applications. Dr. Yashpal
Singh, Alok Singh Chauhan, Neural Networks in Data Mining.