For Seminar Presentation-Edited (Feb5)

INFOLINK U.
COLLEGE - Data Mining (Seminar)
Table of Contents
1. Abstract: .............................................................................................................................. 3
2. Introduction ......................................................................................................................... 4
3. Statement of the problem..................................................................................................... 5
4. Objective ............................................................................................................................. 7
4.1. General Objective ..................................................................................................................... 7
4.2. Specific Objectives ................................................................................................................... 7
5. Data mining ......................................................................................................................... 7

5.1. Data Mining Process ................................................................................................................. 8
5.2. Data mining life cycle ............................................................................................................... 9
5.3. Techniques of data mining .......................................................................................................11
5.4. Applications of Data Mining ...................................................................................................16
6. Methods ............................................................................................................................. 23
6.1. Machine Learning ML .............................................................................................................23
6.2. Data preprocessing ..................................................................................................................24
6.3. Data Classification ...................................................................................................................24
6.4. Clustering: ...............................................................................................................................28
6.5. Association rules: ....................................................................................................................29
7. Conclusion ......................................................................................................................... 32
8. References ......................................................................................................................... 33
DM: Seminar: Artificial Neural Network Based Data Mining Techniques

1 of 33
INFOLINK U.COLLEGE - Data Mining (Seminar)
List of Figures
Figure 1: Knowledge Discovery in Databases Process ................................................................... 8

Figure 2: Data mining life cycle .................................................................................................... 10
Figure 3: Data mining Techniques ................................................................................................ 11
Figure 4: Formation of clusters ..................................................................................................... 13
Figure 5: Linear Regression .......................................................................................................... 14
Figure 6: visualizing error from Lazy.IBk algorithm test.............................................................. 28
List of Tables
Table 1: Types of machine learning .............................................................................................. 23

Table 2: Classification algorithms comparison ............................................................................. 26
Table 3: Association rule contraction ............................................................................................ 30

2 of 33
1. Abstract:
This report presents an overview of Data Mining Techniques and to show you some of the
applications of Artificial Neural Network (ANU) based Data Mining (DM) techniques
which are used in South Nation Nationality People Region (SNNPR) Education Bureau
(EB) Schools’ Inspection’s (SI) data. The bureau does SI every year with clearly Stated
twenty-six (26) standards. The standards focus to different departments’ jobs. The results
of inspection will be delivered to the departments accordingly. The major challenge of SI
is getting enough knowledge to make big decisions and that will help the departments to
know about the school’s level. There are many technologies available to data mining
practitioners, including Artificial Neural Networks, Regression, and Decision Trees. On
this seminar I will show application of Data Mining and technique to monitor SI. This
report also provides a brief overview of artificial neural networks and questions their
position as an applicable tool in data mining.
Key words: Data mining, Machine learning, Artificial neural network, Knowledge
discovery, School Inspection, supervise vs unsupervised

3 of 33
2. Introduction
The development of Information Technology has generated large among of databases and
huge data in various areas. Nowadays corporate and organizations are accumulating data
at an enormous rate and from a very broad variety of sources. A lot of relational database
servers have been built to store such massive quantities of data. As the matter of fact, the
data itself is critical to a company’s growth. It contains knowledge that could lead to
important business decisions that bring business to the next level. These data have never
been examined in a superficial manner. It is becoming data rich but knowledge poor. In
other words, “We are drowning in data, but starving for knowledge!”
We need information but what we have is a huge amount of data flooding around
companies, organizations even individuals. Because of the amount of data is so enormous
that humans cannot process it fast enough to get the information out of it at the right time,
the machine learning technology has been established to solve this problem potentially.
The research in databases and information technology has given rise to an approach to store
and manipulate this precious data for further decision making.
Data mining ‘is the term used to describe the process of extracting value from a database.
A Data-warehouse is a location where information is stored. The type of data stored
depends largely on the type of industry and the company. Data mining (the analysis step
of the "Knowledge Discovery in Databases" process, or KDD), a relatively young and
interdisciplinary field of computer science, & the process that attempts to discover patterns
in large data sets. It utilizes methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems. The overall goal of the data mining process is to
extract information from a data set and transform it into an understandable structure for
further use. Apart from the raw analysis step, it involves database and data management
aspects, data pre-processing, model and inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered structures, visualization, and
online updating.

4 of 33
Data mining is the business of answering questions that you've not asked yet. Data mining
reaches deep into databases. Data mining tasks can be classified into two categories:
Descriptive and Predictive data mining. Descriptive data mining provides information to
understand what is happening inside the data without a predetermined idea. Predictive data
mining allows the user to submit records with unknown field values, and the previous
patterns discovered form the database. Data mining model can be categorized according to
the tasks they perform: Classification and Prediction Clustering, Association Rules.
Classification and prediction are a predictive model but clustering and association rules are
descriptive models.
The most common action in data mining is classification. It recognizes patterns that
describe the group to which an item belongs. It does this by examining existing items that
already have been classified and inferring a set of rules. Similar to classification is
clustering. The major difference being that no groups have been predefined. Prediction is
the construction and use of a model to assess the class of an unlabeled object or to assess
the value or value ranges of a given object is likely to have. The next application is
forecasting. This is different from predictions because it estimates the future value of
continuous variables based on patterns within the data.
Four things are required to data-mine effectively: high-quality data, the “right” data, an
adequate sample size and the right tool. There are many tools available to a data mining
practitioner. These include decision trees, various types of regression and neural networks.
3. Statement of the problem

Education is the most effective instrument for national development. Educational
institutions the most respectful organization where productive and responsible citizens
produced which encompasses from the class room to the whole school surrounding.
Schools need a basic structure and resources to improve and develop school standards to
operate efficiently and effectively.
Strengthen an education system means to align its governance, management, financing and
performance incentive mechanisms to produce quality education for all. The existence

5 of 33
good information about the process of teaching and learning has a huge contribution for
the quality provision of education to enhance learning outcome by improving school
performance.
One of the best tools of quality of education is SI. It measures each school performance,
and categorize them in to one of the four categories. These are: very low (VLL) level, low
level (LL), high level (HL) and very high level (VHL). The overall school performance
result is obtained from three different data subcategories. First one is the input data
category, measures the resources that schools are using as an inputs and it has seven (7)
standards. The second one is processes data category, measures every activities that a
school is doing and it is measured with fourteen (14) standards. The third one is the output
data category, it measures the result of schools recording it is measured with five (5)
standards. The sum of these three will be total performance level of the school.
SNNPR EB perform SI every year and collect schooled data to analyze and delivers the
performance result for departments according to the standards belong to. The departments
then use the information gained form the data for the decision making and for planning to
improve the Schools level those are under minimum standard. However, now day’s
knowledge gained from only data is insufficient and misleading the bureau in some
decision. The data must be mined in order to investigate the association among the standard
results of the schools.
In this Seminar, using some data mining techniques, I’m going to present data mining work
on 2010 SNNPR SI data. Specifically, like: preparation data (data preprocessing),
removing an important attribute, converting data in to better form without affecting the
data, discretizing data, testing some algorithms, select best algorithm for data mining.
flittering using different algorithms and comparing performance of some with respect to SI
data. During DM I’ll use WEKA frame work for demonstration. In SI Data mining I’ll use
WEKA for attribute selection, clustering association rules filters purpose and estimators.
Finally, I’ll prepare some subjection for School level Improvement for data analysis.

6 of 33
4. Objective
4.1. General Objective
The objective of this seminar is to discover knowledge using artificial neural network based
data mining techniques which are applicable in various scenarios.
4.2. Specific Objectives

The specific objectives of this project are to:
• prepare data (data preprocessing) that will be useful for mining.
• use Weka application (frame work) for my DM purpose
• test Decision tree: J48, Bayes: BayesNet, Lazy: IBK, Meta: Attiribute Selected
Classifier, Rule: ZeroR classification algorithms and choose best performing one.
• use EM clustering algorithms in our data set.
• use apriori association rule mining for creating best rule
• to interpret the result
5. Data mining
The process of searching and analyzing large amounts of data is called ‘‘data mining’’. The
large collections of data are the potential lodes of valuable information but like in real
mining, the search and extraction can be a difficult and exhaustive process. Data Mining is
a knowledge discovery process of extracting previously unknown, actionable information
from very large databases. In details it is the non-trivial extraction of implicit, previously
unknown and potentially useful information from data. In other words, it is the search from
relationships and global patterns that exist in large databases, but are ‘‘hidden’’ among the
vast amounts of data. These relationships represent valuable knowledge about the database
and objects in the world.

7 of 33
Figure 1: Knowledge Discovery in Databases Process
The Knowledge Discovery in Databases (KDD) process is commonly defined with the
stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation Evaluation.
5.1. Data Mining Process

Data Mining is performed on the following types of data
▪ Relational databases
▪ Data warehouses
▪ Transactional databases
▪ Advanced DB and information repositories
▪ Object-oriented and object-relational databases
▪ Spatial databases
▪ Time-series data and temporal data
▪ Text databases and multimedia databases
▪ Heterogeneous and legacy databases

8 of 33
Some of the steps involved in the Data Mining process are:
▪ Data cleaning: The task of this step is to remove noise and inconsistent data.
▪ Data integration: In this step, multiple data sources like the ones mentioned in the
section above can be combined to an integrated collection of data.
▪ Data selection: All the data relevant to the analysis task is retrieved from the
database in this step.
▪ Data transformation: The data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
▪ Data mining: The critical step where intelligent methods are applied in order to
extract data patterns.
▪ Pattern evaluation: This step is deployed to identify the truly interesting patterns
representing knowledge based on certain measures.
▪ Knowledge presentation: In the final step, various visualization and knowledge
representation techniques are used to present the mined knowledge to the user.
Data mining has five main functions:
▪ Classification: It infers the defining characteristics of a certain group.

▪ Clustering: It identifies groups of items that share a particular characteristic.
(Clustering differs from classification in that no predefining characteristic is given
in classification.)
▪ Association: It identifies relationships between events that occur at one time.
▪ Sequencing: It is similar to association, except that the relationship exists over a
▪ period of time.
▪ Forecasting: It estimates future values based on patterns within large sets of data.
5.2. Data mining life cycle

Operation on data mining life cycle can be viewed as a three-stage process of preparing the
data for mining, deriving the model, and using the knowledge obtained from the data. (Fig.
1) The data preparation stage deals with improving the data quality and summarizing the
data to facilitate the analysis and discovery process. Data mining can be done on either

9 of 33
operational databases or on a data warehouse, which is usually a summary database of the

various businesses of an enterprise. The quality of the data in the data warehouse is
constantly monitored by data analysts. Due to the heterogeneity and non-standard policies
enforced on data quality at the different source databases, the warehouse data is usually
cleaned or standardized via data scrubbing.
Figure 2: Data mining life cycle
The model derivation stage focuses on choosing learning samples, testing samples and
learning algorithms. Due to the large volume of available data, data mining may be done
on subsets of the data from the data warehouse. An appropriate data sample is selected
from the data in the warehouse and is checked for descriptiveness. This process may have
to iterate a few times before a suitable sample set can be selected. The selected sample
dataset forms the training data for the data-mining algorithm. The data-mining process is
viewed in our framework as the derivation of an appropriate knowledge model of the
patterns in the data that are interesting to the user. The algorithm for model derivation,
together with the guidance provided by the user, will generally produce several models of
the information contained in the data. The data-mining algorithms use guidance from the
analyst to decide various parameters of the model being learned from the data, such as its
accuracy and prevalence, and to control the computational complexity of the learning
process.

10 of 33
Among all the models generated, the users may only select a few interesting models to be
included in their applications. The usage and maintenance phase is concerned with
monitoring of database updates and continued validation of patterns learned in the past.
Even though the learning process may have user guidance, not all the knowledge models
generated will have business applications. Only the interesting models are selected and
applied performing business tasks. Another important task in this stage of the life cycle is
to continuously monitor the validity of the knowledge models in the context of changes to
data in the warehouse. When the population in the warehouse shifts significantly, the
previously learned models will no longer be applicable, and new models will have to be
derived. We may also be able to learn new models incrementally from the new data.
5.3. Techniques of data mining

Below are main data mining techniques that can help us to create good results in data
mining.
Figure 3: Data mining Techniques
i. Classification Analysis
This analysis is used to retrieve important and relevant information about data, and
metadata. It is used to classify different data in different classes. Classification is similar
to clustering in a way that it also segments data records into different segments called
11 of 33
classes. But unlike clustering, here the data analysts would have the knowledge of different
classes or cluster. So, in classification analysis you would apply algorithms to decide how
new data should be classified. Fraud detection and credit risk applications are particularly
well suited to this type of analysis. This approach frequently employs decision tree or
neural network-based classification algorithms.
The data classification process involves earning and classification. In Learning, the training
data are analyzed by classification algorithm. In classification, test data are used to estimate
the accuracy of the classification rules. If the accuracy is acceptable, the rules can be
applied to the new data tuples Classification method makes use of mathematical techniques
such as decision trees, linear programming, neural network and statistics. In classification,
we make the software that can learn how to classify the data items into groups.
The classifier-training algorithm uses these pre-classified examples to determine the set of
parameters required for proper discrimination. The algorithm then encodes these
parameters into a model called a classifier.
▪ Types of classification models:

▪ Classification by decision tree induction
▪ Bayesian Classification
▪ Support Vector Machines (SVM)
▪ Classification Based on Associations
ii. Clustering Analysis
The cluster is actually a collection of data objects; those objects are similar within the same
cluster. That means the objects are similar to one another within the same group and they
are rather different or they are dissimilar or unrelated to the objects in other groups or in
other clusters. Clustering analysis is the process of discovering groups and clusters in the
data in such a way that the degree of association between two objects is highest if they
belong to the same group and lowest otherwise. A result of this analysis can be used to
create customer profiling.

12 of 33
In clustering technique, the classes are defined and accordingly objects are put in them,
whereas in classification objects are assigned into predefined classes.
Figure 4: Formation of clusters
Types of clustering methods:

▪ Partitioning Methods
▪ Hierarchical methods
▪ Density based methods
▪ Grid-based methods
▪ Model-based methods
iii. Regression Analysis
In statistical terms, a regression analysis is the process of identifying and analyzing the
relationship among variables. It can help you understand the characteristic value of the
dependent variable changes, if any one of the independent variables is varied. This means
one variable is dependent on another, but it is not vice versa. It is generally used for
prediction and forecasting.

13 of 33
Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning. Regression analysis is also used to
understand which independent variables are related to the dependent variable, and to
explore the forms of these relationships.
In data mining, independent variables are attributes already known and response variables
are what we want to predict. Real-world problems are very difficult to predict because they
may depend on complex interactions of multiple predictor variables. Therefore, more
complex techniques (e.g. logistic regression, decision trees, or neural nets) may be
necessary to forecast future values. The same model types can often be used for both
regression and classification. For example, the CART (Classification and Regression
Trees) decision tree algorithm can be used to build both classification trees (to classify
categorical response variables) and regression trees (to forecast continuous response
variables). Neural networks too can create both classification and regression models.
Figure 5: Linear Regression
Types of regression methods
▪ Linear Regression
▪ Multivariate Linear Regression
▪ Nonlinear Regression
▪ Multivariate Nonlinear Regression

14 of 33
iv. Association Rule Learning
It refers to the method that can help you identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can help us
unpack some hidden patterns in the data that can be used to identify variables within the
data and the concurrence of different variables that appear very frequently in the dataset.
Association rules are useful for examining and forecasting customer behavior. This
technique is used to determine shopping basket data analysis, product clustering, catalog
design and store layout. In IT, programmers use association rules to build programs capable
of machine learning.
Association and correlation are usually to find frequent item set findings among large data
sets. This type of finding helps businesses to make certain decisions, such as catalogue
design, cross marketing and customer shopping behavior analysis. Association rules are
usually required to satisfy a user-specified minimum support and a user-specified minimum
confidence at the same time. Association rule generation is usually split up into two
separate steps: First, minimum support is applied to find all frequent item sets in a database.
Second, these frequent item sets and the minimum confidence constraint are used to form
rules. Association Rule algorithms need to be able to generate rules with confidence values
less than one. However, the number of possible Association Rules for a given dataset is
generally very large and a high proportion of the rules are usually of little (if any) value.
Types of association rule:

• Multilevel association rue
• Multidimensional association rule
• Quantitative association rule
v. Neural networks
An Artificial Neural Network (ANN), usually called neural network (NN), is a

mathematical model or computational model that is inspired by the structure and functional
aspects of biological neural networks. A neural network consists of an interconnected
group of artificial neurons, and it processes information using a connection-based approach
15 of 33
to computation. In most cases an ANN is an adaptive system that changes its structure
based on external or internal information that flows through the network during the learning
phase. Modern neural networks are non-linear statistical data modelling took. They are
usually used to model complex relationships between inputs and outputs or to find patterns
in data.
Neural network is a set of connected input/output units and each connection has a weight
present with it. During the learning phase, network learns by adjusting weights so as to be
able to predict the correct class label of the input tuples. Neural networks have the
remarkable ability to derive meaning from complicated or imprecise data and can be used
to extract patterns and detect trends that are too complex to be noticed by either humans or
other computer techniques. These are well suited for continuous valued inputs and outputs.
Neural networks are best at identifying patterns or trends in data and well suited for
prediction or forecasting needs.
vi. Anomaly or Outlier Detection
This refers to the observation for data items in a dataset that do not match an expected
pattern or an expected behavior. Anomalies are also known as outliers, novelties, noise,
deviations and exceptions. Often, they provide critical and actionable information. An
anomaly is an item that deviates considerably from the common average within a dataset
or a combination of data. These types of items are statistically distant as compared to the
rest of the data and hence, it indicates that something out of the ordinary has happened and
requires additional attention. This technique can be used in a variety of domains, such as
intrusion detection, system health monitoring, fraud detection, fault detection, event
detection in sensor networks, and detecting eco-system disturbances. Analysts often
remove the anomalous data from the dataset top discover results with an increased
accuracy.
5.4. Applications of Data Mining

Data mining is a relatively new technology that has not fully matured. Despite this. there
are a number of industries that are already using it on a regular basis. Some of these

16 of 33
organizations include retail stores, hospitals, banks, and insurance companies. Many of
these organizations are combining data mining with such things as statistics, pattern
recognition, and other important took. Data mining can be used to find patterns and
connections that would otherwise be difficult to find. This technology is popular with many
businesses because it allows them to earn more about their customers and make smart
marketing decisions.
There are a number of applications that data mining has. The first is called market
segmentation. With market segmentation, we can find behaviors that are common among
customers. We can look for patterns among customers that seem to purchase the same
products at the same time. Another application of data mining is called customer churn
Customer churn allows us to estimate which customers are the most likely to stop
purchasing our products or services and go to one of our competitors. In addition to this, a
company can use data mining to find out which purchases are the most likely to be
fraudulent. For example, by using data mining in retail stores. we may be able to determine
which products are stolen the most. By finding out which products are stolen the most,
steps can be taken to protect those products and detect those who are stealing them. We
can also use data mining to determine the effectiveness of interactive marketing. Some of
the customers will be more likely to purchase the products online than offline, and we must
identify them.
While many businesses use data mining to help increase their profits, it can also be used to
create new businesses and industries. One industry that can be created by data mining is
the automatic prediction of both behaviors and trends. Using this automated prediction, we
can have an advantage over the competition Instead of simply guessing what the next big
trend will be, we can determine it based on statistics, patterns, and logic. Another
application of automatic prediction is to use data mining to book at the past marketing
strategies to determine the best one so far and the reason for it being the best. We can avoid
making any mistakes that occurred in previous marketing campaigns. Data mining is abo
a powerful tool for those who deal with finances. A financial institution such as a bank can
predict the number of defaults that will occur among their customers within a given period
of time, and they can also predict the amount of fraud that will occur as well.

17 of 33
Another application of data mining is the automatic recognition of patterns that were not
previously known. While data mining is a very valuable tool, it is important to realize that
it is not a complete solution. Even if an automated technolgy should be invented, it will not
guarantee the success of the company. However, it will tip the odds in our favor.
Specific Application Areas of Data Mining
Data Mining for Financial Data Analysis few typical cases:
▪ Design and construction of data warehouses for multidimensional data analysis.

▪ Loan payment prediction and customer credit policy analysis.
▪ Classification and clustering of customers for targeted marketing.
▪ Detection of money laundering and other financial crimes.
▪ Data Mining for the Retail Industry.
A few examples of data mining in the retail industry:
▪ Design and construction of data warehouses based on the benefits of data mining
▪ Multidimensional analysis of sales, customers, products, time, and region
▪ Analysis of the effectiveness of sales campaigns
▪ Customer retention—analysis of customer loyalty
▪ Product recommendation and cross-referencing of items
Data Mining for the Telecommunication Industry
▪ Multidimensional analysis of telecommunication data.

▪ Fraudulent pattern analysis and the identification of unusual patterns.
▪ Multidimensional association and sequential pattern analysis.
▪ Mobile telecommunication services.
▪ Use of visualization took in telecommunication data analysis.
Data Mining for Biological Data Analysis
▪ Semantic integration of heterogeneous, distributed genomic and proteomic

databases.

18 of 33
▪ Alignment, indexing, similarity search, and comparative analysis of multiple

nucleotide/protein sequences.
▪ Discovery of structural patterns and analysis of genetic networks and protein
pathways.
▪ Association and path analysis: identifying co-occurring gene sequences and linking
genes to different stages of disease development.
▪ Visualization tools in genetic data analysis.
Data Mining in Other Scientific Applications
Data collection and storage technologies have recently improved. so that today, scientific
data can be amassed at much higher speeds and lower costs. This has resulted in the
accumulation of huge volumes of high-dimensional data, stream data, and heterogeneous
data, containing rich spatial and temporal information Consequently, scientific applications
are shifting from the “hypothesize-and-test” paradigm toward a “collect and store data,
mine for new hypotheses, confirm with data or experimentation” process. This shift brings
about new challenges for data mining.
Spatial Data Mining
A spatial database stores a large amount of space-related data, such as maps, pre-processed
remote sensing or medical imaging data, and VLSI chip layout data. Spatial data mining
refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases.
Spatial data mining is the application of data mining methods to spatial data. The end
objective of spatial data mining is to find patterns in data with respect to geography. So far,
data mining and Geographic Information Systems (GIS) have existed as two separate
technologies, but now Data mining offers great potential benefits for GIS-based applied
decision-making.

19 of 33
Spatial Data Cube Construction and Spatial OLAP
As with relational data, we can integrate spatial data to construct a data warehouse that
facilitates spatial data mining, A spatial data warehouse is a subject-oriented, integrated,
time variant and non-volatile collection of both spatial and noncapital data in support of
spatial data mining and spatial-data-related decision-making processes.
There are three types of dimensions in a spatial data cube:
▪ A non-spatial dimension
▪ A spatial to-non-spatial dimension
▪ A spatial to-spatial dimension
We can distinguish two types of measures in a spatial data cube:
▪ A numerical measure contains only numerical data
▪ A spatial measure contains a collection of pointers to spatial objects
Spatial database systems usually handle vector data that consist of points, lines, polygons
(regions), and their compositions, such as networks or partitions. Typical examples of such
data include maps, design graphs, and 3-D representations of the arrangement of the chains
of protein molecules.
Multimedia Data Mining

A multimedia database system stores and manages a large collection of multimedia data,
such as audio, video, image. graphics, speech, text, document. and hypertext data, which
contain text. text mark-ups, and linkages Similarity Search in Multimedia Data When
searching for similarities in multimedia data, we can search on either the data description
or the data content approaches:
▪ Colour histogram-based signature
▪ Multi feature composed signature
▪ Wavelet-based signature
▪ Wavelet-based signature with region-based granularity
▪ Multidimensional Analysis of Multimedia Data

20 of 33
To facilitate the multidimensional analysis of large multimedia databases, multimedia data

cubes can be designed and constructed in a manner similar to that for traditional data cubes
from relational data. A multimedia data cube can contain additional dimensions and
measures for multimedia information, such as colour, texture, and shape.
Classification and Prediction Analysis of Multimedia Data

Classification and predictive modelling can be used for mining multimedia data, especially
in scientific research, such as astronomy, seismology, and geo-scientific research
Mining Associations in Multimedia Data:

▪ Associations between image content and non-image content features:
▪ Associations among image contents that are not related to spatial relationships
▪ Associations among image contents related to spatial relationships:
Audio and Video Data Mining

An extraordinary amount of audiovisual information is becoming available in digital form
in digital archives, on the World Wide Web, in broadcast data streams, and in personal and
professional databases, and hence there is a need to mine them. Visual data mining
discovers implicit and useful knowledge from large data sets using data and/or knowledge
visualization techniques.
In general data visualization and data mining can be integrated in the following ways:
▪ Data visualization
▪ Data mining result visualization
▪ Data mining process visualization
▪ Interactive visual data mining
Mining the World Wide Web

The World Wide Web serves as a huge, widely distributed, global information service
center for news, advertisements, consumer information, financial management, education,
government, e-commerce, and many other information services. The Web also contains a
rich and dynamic collection of hyperlink information and Web page access and usage
information providing rich sources for data mining.

21 of 33
Challenges:
▪ The Web seems to be too huge for effective data warehousing and data mining
▪ The complexity of Web pages is far greater than that of any traditional text
document collection
▪ The Web is a highly dynamic information source
▪ The Web serves a broad diversity of user communities
▪ Only a small portion of the information on the Web is truly relevant or useful
Besides mining Web contents and Web linkage structures, another important task for Web
mining is Web usage mining.
Data Mining for Intrusion Detection

The security of our computer systems and data is at continual risk. The extensive growth
of the Internet and increasing availability of tools and ticks for intruding and attacking
networks have prompted intrusion detection to become a critical component of network
administration. Some areas in which data mining technology may be applied or further
developed for intrusion detection
1. Development of data mining algorithms for intrusion detection.

2. Association and correlation analysis, and aggregation to help select and build
discriminating attributes
3. Analysis of stream data
4. Distributed data mining
5. Visualization and querying took
Data Mining System Products and Research Prototypes data mining systems should be
assessed based on the following multiple features: Data types, System issues, Data sources,
Data mining functions and methodologies. Coupling data mining with database and/or data
warehouse systems, Scalability, Visualization tools, Data mining query language and
graphical user interface.

22 of 33
6. Methods
6.1. Machine Learning ML
There are three types of machine learning
Table 1: Types of machine learning
I. Supervised II. Unsupervised III. Reinforcement

- Machin learns under - No guide or direction - There is an agent
guidance is set for the machine takes an action and
- Feeding the with - The data isn’t labeled transition from one
labeled data - The machine has to state to another so it
- The input and what figure out the data set can get maximum
the output will look and reward
like will be guided - fined hidden pattern
- The teacher/Guide is to predict the output
the training data
Types of Problems
- Classification - Association Problems - The input itself is

problems and and depending on the
- Regression Problems - Clustering Problems action we take
Training
- Training is well - The training phase is - No predefined data

defined and explicit big - The whole phase is
- The algorithms do - No supervisor/teacher training and testing
map input to the here phase
output - Is learning by
- Training data act like exploring and
the teacher/guide collecting data
The Aim
- To focus on the - Discovering the - Trial and error

outcome pattern and extracting method
useful insight
Feedback
- Direct feedback - No feedback - Feedback is from

mechanism mechanism punishment and
reward

23 of 33
Popular algorithms
- Decision tee - Key mean, C means - Q learning stat action

- Linear regression - A prior and reward
association rule
mining
Applications
- Forecasting risk - Olene shopping - Self-driving cars

- Risk analysis recommendation - Building games
- Predicting Seles and - Anomaly detection
profits - Credit card fraud
detection
6.2. Data preprocessing

Original data contain detailed information, totally 44 attributes. Most of them has no
importance at the future, has no importance for decision making, has a very little
contribution in knowledge delivery … etc. There for I eliminate them from data. Less than
31 attributes will be used for this seminar. Another task is converting the data from
percentile form in to the equivalent numeric form without affecting its meaning. Which is
called transforming data. For example, 46.7%, 25%, 74.3% in to 46.7, 25, 74.3. This will
help us to use many algorithms and easy for discretization. Finally discretizing using equal
distance 10 bins. The dataset has 31 attributes and 2838 instance. Finally classification use
the class attribute grade which has value VL = Very low level, L = low level, H = High
level and VH = very high level.
6.3. Data Classification

The algorithms that areused in SI data classification to compared output are:
1. Decision tree: J48

2. Bayes: BayesNet
3. Lazy: IBK
4. Meta: AttiributeSelectedClassifier
5. Rule: ZeroR

24 of 33
To use all the remaining functions I convert the data format in to data required by the
software is in the “.arff” format.
Features:
I create an instance of the class to execute it. The functionality of WEKA is classified based
on the steps of Machine learning. The Classifiers class prints out a decision tree classifier
for the dataset given as input. Also a ten-fold cross-validation estimation of its performance
is also calculated. The Classifiers package implements the most common techniques
separately for categorical and numerical values
Other classifiers algorisms to be used for classification are:
a) Classifiers for categorical prediction:
1. Weka.classifiers.IBk K-nearest neighbor learner

2. Weka.classifiers.j48.J48 C4.5 decision trees
3. Weka.classifiers.j48.PART Rule learner
4. Weka.classifiers.NaiveBayes Naive Bayes with/without kernels
5. Weka.classifiers.OneR Holte's oner
6. Weka.classifiers.KernelDensity Kernel density classifier
7. Weka.classifiers.SMO Support vector machines
8. Weka.classifiers.Logistic Logistic regression
9. Weka.classifiers.AdaBoostM1 Adaboost
10. Weka.classifiers.LogitBoost Logit boost
11. Weka.classifiers.DecisionStump Decision stumps (for boosting)
b) Classifiers for numerical prediction:
1. weka.classifiers.LinearRegression Linear regression

2. weka.classifiers.m5.M5Prime Model trees
3. weka.classifiers.Ibk K-nearest neighbor learner
4. weka.classifiers.LWR Locally weighted regression
5. weka.classifiers.RegressionByDiscretization Uses categorical classifiers

25 of 33
Sample Executions of the various categorical CLASSIFIER Algorithms, their

results are shown below:
Table 2: Classification algorithms comparison
1. Classifiers.trees.J48 -C 0.25 -M 2 (split 90.0% train, remainder test)
Preformance evaaluation:
▪ Classifiers.trees.J48 -C 0.25 -M 2 using 90.0% train, remainder test, classifies 247 instants
correctly and 37 incorrectly. It has 87% performance.
▪ Time taken to test model on test split: 0.23 seconds
▪ Correctly Classified Instances 247 86.97 %
▪ Incorrectly Classified Instances 37 13.03 %
2. Classifiers.bayes.BayesNet (split 90.0% train, remainder test)
▪ Bayes.BayesNet using 90.0% train, remainder test, classifies 258 instants correctly and 26
incorrectly. It has 90.85% performance.
3. Classifiers.lazy.IBk (split 90.0% train, remainder test)
▪ Lazy.IBk using 90.0% train, remainder test, classifies 261 instants correctly and 23
incorrectly. It has 91.9% performance.

26 of 33
4. Classifiers.meta.AttributeSelectedClassifier (split 90.0% train, remainder test)
▪ Meta.AttributeSelectedClassifier using 90.0% train, remainder test, classifies 248 instants

correctly and 38 incorrectly. It has 86.6% performance.
5. classifiers.rules.ZeroR (split 90.0% train, remainder test)
▪ Meta.AttributeSelectedClassifier using 79.2% train, remainder test, classifies 246 instants

correctly and 59 incorrectly. It has 79.2% performance.
Frome the above test teults we can see that Lazy.IBk using 90.0% train, remainder test,
algorithms, is the best algorithms for our dataset.
Visualizing errors
Lazy.IBk aligorithm (unsupervised data setting and split 90.0% train, remainder test) there are 23
errors which are classified under different level but they would have to be under where they are
seen below (i.e highlighted with yellow mark).

27 of 33
Figure 6: visualizing error from Lazy.IBk algorithm test
6.4. Clustering:
Clusterers. EM clustering Algorisms is used in SI data clustering demonstration, with a
ten-fold cross-validation, 10 bines and using equal frequency, estimation of its performance
is also calculated.
The result was obtained from:

Instances: 2838
Attributes: 27 (I removed Input, process, output and performance, because they are
driveled form the standards 1 to 26)
Standared-1 Standared-10 Standared-19
Standared-9 Standared-18
Ignored: GRADE
Test mode: Evaluate on training data

28 of 33
=== Clustering model (full training set) ===

EM
================================
Number of clusters selected by cross validation: 15
Number of iterations performed: 100
Time taken to build model (full training data) : 448.6 seconds
=== Model and evaluation on training set ===

Clustered Instances
0 365 ( 13%) 1 80 ( 3%) 2 93 ( 3%)
3 242 ( 9%) 4 174 ( 6%) 5 83 ( 3%)
6 501 ( 18%) 7 60 ( 2%) 8 91 ( 3%)
9 733 ( 26%) 10 100 ( 4%) 11 21 ( 1%)
12 129 ( 5%) 13 47 ( 2%) 14 119 ( 4%)
6.5. Association rules:

Apriori Algorisms is selected in the SI data to demonstrate Association rule.
Association rule mining finds interesting association or correlation relationships among the
set of data items. With massive amounts of IS data continuously being collected and stored
in excel, SNNPR EB becoming interested in mining association rules from their databases.
For example, the discovery of interesting association relationships among huge amounts of
SI records can help standards categorization, cross school performance analysis, resource
distribution and conception behavior, and other management decision making processes.
For example, association rule mining for a standard progress analysis. This process
analyzes school performance habits by finding associations between the different standards
that Schools leveled every year. The discovery of such associations can help EB to develop
different strategies by gaining insight into which standards are frequently lay under
performance together by School. For instance, if schools are lay below the standard in one
of the input standards let as say standard-1 = L, how likely are they to also they lay below

29 of 33
in Performance standard too (and what kind of Standard)? Such information can lead to
improve the standards of schools.
Typical output of the Association package we use is Apriori Principle:
Table 3: Association rule contraction
=== Run information ===
Scheme: Associations.Apriori, Algorithm

Relation: DM-Project-School_Inspection-New-
weka.filters.unsupervised.attribute.Remove-R1-
Instances: 2838
Attributes: 27
Standared-25 Standared-26
GRADE
=== Associator model (full training set) ===

Apriori
================================
Minimum support: 0.1 (284 instances)
Minimum metric <confidence>: 0.9

30 of 33
=== Best rules found: ===
1. Standared-1='(40-47.5]' 477 ==> GRADE=L 453 <conf:(0.95)> lift:(1.24)

lev:(0.03) [89] conv:(4.52)
2. Standared-3='(-inf-32.5]' Standared-24='(62.5-70]' 336 ==> GRADE=L 319
<conf:(0.95)> lift:(1.24) lev:(0.02) [62] conv:(4.43)
3. Standared-6='(70-80]' Standared-9='(47.5-55]' 318 ==> GRADE=L 297
lev:(0.03) [85] conv:(3.41)
5. Standared-3='(-inf-32.5]' Standared-10='(62.5-70]' 323 ==> GRADE=L 300
lev:(0.02) [68] conv:(3.19)
lev:(0.03) [73] conv:(2.87)
8. Standared-1='(47.5-55]' 568 ==> GRADE=L 522 <conf:(0.92)> lift:(1.2)
lev:(0.03) [88] conv:(2.87)
9. Standared-12='(70-77.5]' Standared-24='(62.5-70]' 369 ==> GRADE=L 339
10. Standared-9='(47.5-55]' Standared-12='(70-77.5]' 366 ==> GRADE=L 336

31 of 33
7. Conclusion
This seminar report provided an overview of Data Mining Process, its techniques and
applications. The following conclusions can be drawn:
I. Data Mining is a crucial step in the Knowledge Discovery in Databases Process but can
only be performed after pre-processing and transformation.
II. Although the basic steps in data mining include data cleaning selection and
transformation; the functions and techniques are only applied in the vital step where
intelligent methods are used to detect patterns.
III. A model for Data Mining is useful for a company or a data mining practitioner as it
helps in adapting a result oriented approach.
IV. Cross Industry Standard Process for Data Mining Model is an effective approach to a
model which considers business requirements at every step.
V. Classification and Clustering techniques are popular and easily applicable in data
mining, however classification we require prior characteristic information.
VI. Artificial Neural Networks can be deployed to detect patterns and make predictions
which make them capable took in data mining. A feed forward neural network uses a
back-propagation algorithm to train itself.
VII. The application of data mining techniques along with GIS techniques makes for a
potential opportunity to explore various aspects of Spatial Data Mining.
VIII. The growth of data available for processing. as well as multimedia elements and the
world wide web leads to greater opportunities for data mining techniques. However
the pre-processing, section and transformation needs to be handled first.
In this seminar I have used WEKA framework for Data Mining on SNNPR EB IS data. I
also test the dataset with different classification algorithms and compared the performance
with respect to our dataset. I found that Lazy.IBk using 90.0% train, remainder test, is the
best performing algorithm to develop application. I also found 10 best association rules
which help for decision making and getting knowledge about the standards association.

32 of 33
8. References
[1] M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database
perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996.
[2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy. Advances in

Knowledge Discovery and Data Mining. AAAVMIT Press, 1996.
[3] W.J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in

Databases: An Overview. In G. Piatetsky-Shapiro et al. (eds.), Knowledge
Discovery in Databases. AAAI/MIT Press, 1991.
[4] J. Han and M. Kanber. Data Mining: Concepts and Techniques.
[5] Morgan T. Imielinski and H. Mannila. A database perspective on knowledge.
[6] G. Piatetsky-Shapiro, U. M. Fayyad, and P. Smyth. From data mining to

knowledge discovery: An overview.
[7] In U.M. Fayyad, et al (eds.), Advances in Knowledge Discovery and Data

Mining, 1-35. AAAI/MIT Press, 1996.
[8] G. Piatetsky-Shapo and W.J. Frawkey. Knowledge Discovery in Databases.

AAAUMIT Press, 1991.
[9] Portia A. Cemy, Data mining and Neural Networks from a Commercial
Perspective
{10} Bharati M. Ramageri Data Mining Techniques and applications. Dr. Yashpal
Singh, Alok Singh Chauhan, Neural Networks in Data Mining.

33 of 33

For Seminar Presentation-Edited (Feb5)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

For Seminar Presentation-Edited (Feb5)

Uploaded by

Copyright:

Available Formats

INFOLINK U.

COLLEGE - Data Mining (Seminar)

4.2. Specific Objectives ................................................................................................................... 7

5. Data mining ......................................................................................................................... 7

5.2. Data mining life cycle ............................................................................................................... 9

5.3. Techniques of data mining .......................................................................................................11

5.4. Applications of Data Mining ...................................................................................................16

6.2. Data preprocessing ..................................................................................................................24

6.3. Data Classification ...................................................................................................................24

6.4. Clustering: ...............................................................................................................................28

6.5. Association rules: ....................................................................................................................29

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

Figure 1: Knowledge Discovery in Databases Process ................................................................... 8

Table 1: Types of machine learning .............................................................................................. 23

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

3. Statement of the problem

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

4.2. Specific Objectives

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

Figure 1: Knowledge Discovery in Databases Process

5.1. Data Mining Process

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

Some of the steps involved in the Data Mining process are:

▪ Classification: It infers the defining characteristics of a certain group.

5.2. Data mining life cycle

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

operational databases or on a data warehouse, which is usually a summary database of the

Figure 2: Data mining life cycle

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

5.3. Techniques of data mining

Figure 3: Data mining Techniques

▪ Types of classification models:

ii. Clustering Analysis

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

Figure 4: Formation of clusters

Types of clustering methods:

iii. Regression Analysis

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

Figure 5: Linear Regression

Types of regression methods

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

iv. Association Rule Learning

Types of association rule:

An Artificial Neural Network (ANN), usually called neural network (NN), is a

vi. Anomaly or Outlier Detection

5.4. Applications of Data Mining

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

Specific Application Areas of Data Mining

Data Mining for Financial Data Analysis few typical cases:

▪ Design and construction of data warehouses for multidimensional data analysis.

▪ Multidimensional analysis of telecommunication data.

Data Mining for Biological Data Analysis

▪ Semantic integration of heterogeneous, distributed genomic and proteomic

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

▪ Alignment, indexing, similarity search, and comparative analysis of multiple

Data Mining in Other Scientific Applications

Spatial Data Mining

DM: Seminar: Artificial Neural Network Based Data Mining Techniques

Spatial Data Cube Construction and Spatial OLAP