Ass 2

University of Gondar
Faculity of Informatics
Department of Information Technology for PG program
Year-2nd
Assignment 2 for data warehousing and data mining
(IT 617)
Set by:
Tesfaye Ashebr Sec. 3
ID GUS/22461/10
Submited to: Instructor Mengistu Belete (Phdc)
August 2011 E.C
Gondar, Ethiopia
ACKNOWLEDGEMENT
First of all we would like to thank our GOD that helped us in all up and down and who made us
to successfully complete this assignment effectively beyond what we had anticipated. Next we
would like to express our deepest gratitude to our Instructor Mengistu Belete and who
contributed all their idea in order to make this assignment the best as much as we need.
What Is Over-fitting?
Over-fitting is a modeling error which occurs when a function is too closely fit to a limited set of
data points. Over-fitting the model generally takes the form of making an overly complex model
to explain idiosyncrasies in the data under study.
In reality, the data often studied has some degree of error or random noise within it. Thus,
attempting to make the model conform too closely to slightly inaccurate data can infect the
model with substantial errors and reduce its predictive power.
For instance, a common problem is using computer algorithms to search extensive databases of

historical market data in order to find patterns. Given enough study, it is often possible to
develop elaborate theorems which appear to predict things such as returns in the stock market
with close accuracy.
However, when applied to data outside of the sample, such theorems may likely prove to be
merely the over-fitting of a model to what were in reality just chance occurrences. In all cases, it
is important to test a model against data which is outside of the sample used to develop it.
Key Take a ways
 Over-fitting is a modeling error which occurs when a function is too closely fit to a
limited set of data points.
 Financial professionals must always be aware of the dangers of over-fitting a model
based on limited data.
Some too easy problems solved by data mining algorithms are:
Manual data entry
Inaccuracy and duplication of data are major business problems for an organization wanting to
automate its processes. Machines learning (ML) algorithms and predictive modeling algorithms
can significantly improve the situation. ML programs use the discovered data to improve the
process as more calculations are made. Thus machines can learn to perform time-intensive
documentation and data entry tasks. Also, knowledge workers can now spend more time on
higher-value problem-solving tasks. Aria, an AI based firm has developed a natural language
processing technology which scans texts and determines the relationship between concepts to
write reports.
Detecting spam
Spam detection is the earliest problem solved by ML. Four years ago, email service providers
used pre-existing rule-based techniques to remove spam. But now the spam filters create new
rules themselves using ML. Thanks to ‘neural networks’ in its spam filters, Google now boasts
of 0.1 percent of spam rate. Brain-like “neural networks” in its spam filters can learn to
recognize junk mail and phishing messages by analyzing rules across an enormous collection of
computers. In addition to spam detection, social media websites are using ML as a way to
identify and filter abuse.
Product recommendation
Unsupervised learning enables a product based recommendation system. Given a purchase
history for a customer and a large inventory of products, ML models can identify those products
in which that customer will be interested and likely to purchase. The algorithm identifies hidden
pattern among items and focuses on grouping similar products into clusters. A model of this
decision process would allow a program to make recommendations to a customer and motivate
product purchases. E-Commerce businesses such as Amazon have this capability. Unsupervised
learning along with location detail is used by Facebook to recommend users to connect with
others users.
Medical diagnosis
Machine learning in the medical field will improve patient’s health with minimum costs. Use
cases of ML are making near perfect diagnoses, recommend best medicines, predict readmissions
and identify high-risk patients. These predictions are based on the dataset of anonym zed patient
records and symptoms exhibited by a patient. Adoption of ML is happening at a rapid pace
despite many hurdles, which can be overcome by practitioners and consultants who know the
legal, technical, and medical obstacles.
Customer segmentation and lifetime value prediction
Customer segmentation, churn prediction and customer lifetime value (LTV) prediction are the
main challenges faced by any marketer. Businesses have a huge amount of marketing relevant
data from various sources such as email campaign, website visitors and lead data. Using data
mining and machine learning, an accurate prediction for individual marketing offers and
incentives can be achieved. Using ML, savvy marketers can eliminate guesswork involved in
data-driven marketing. For example, given the pattern of behavior by a user during a trial period
and the past behaviors of all users, identifying chances of conversion to paid version can be
predicted. A model of this decision problem would allow a program to trigger customer
interventions to persuade the customer to convert early or better engage in the trial.
Financial analysis
Due to large volume of data, quantitative nature and accurate historical data, machine learning
can be used in financial analysis. Present use cases of ML in finance include algorithmic trading,
portfolio management, fraud detection and loan underwriting. According to Ernst and Young
report on ‘The future of underwriting’ – Machine learning will enable continual assessments of
data for detection and analysis of anomalies and nuances to improve the precision of models and
rules. And machines will replace a large no. of underwriting positions. Future applications of ML
in finance include chat-bots and conversational interfaces for customer service, security and
sentiment analysis.
Predictive maintenance
Manufacturing industry can use artificial intelligence (AI) and ML to discover meaningful
patterns in factory data. Corrective and preventive maintenance practices are costly and
inefficient. For predictive maintenance, ML architecture can be built which consists of historical
device data, flexible analysis environment, and work flow visualization tool and operations
feedback loop. Azure ML platform provides an example of simulated aircraft engine run-to-
failure events to demonstrate the predictive maintenance modeling process. The asset is assumed
to have a progressing degradation pattern. This pattern is reflected in asset’s sensor
measurement. In order to predict future failures, ML algorithm learns the relationship between
sensor value and changes in sensor values to historical failures.
Image recognition (computer vision)
Computer vision produces numerical or symbolic information from images and high-dimensional
data. It involves machine learning, data mining, database knowledge discovery and pattern
recognition. Potential business uses of image recognition technology are found in healthcare,
automobiles – driverless cars, marketing campaigns, etc.
Below are the examples of too difficult machine learning problems that really ground what
machine learning is all about?
 Spam Detection: Given email in an inbox, identify those email messages that are spam
and those that are not. Having a model of this problem would allow a program to leave non-spam
emails in the inbox and move spam emails to a spam folder. We should all be familiar with this
example.
 Credit Card Fraud Detection: Given credit card transactions for a customer in a month,
identify those transactions that were made by the customer and those that were not. A program
with a model of this decision could refund those transactions that were fraudulent.
 Digit Recognition: Given a zip codes hand written on envelops, identify the digit for
each hand written character. A model of this problem would allow a computer program to read
and understand handwritten zip codes and sort envelops by geographic region.
 Speech Understanding: Given an utterance from a user, identify the specific request
made by the user. A model of this problem would allow a program to understand and make an
attempt to fulfill that request.
 Face Detection: Given a digital photo album of many hundreds of digital photographs,
identify those photos that include a given person. A model of this decision process would allow a
program to organize photos by person. Some cameras and software like iPhoto has this
capability.
The classifier algorithm is trees J48

No List of parameters Value
1 Batch size 100
2 Binary split True
3 Collapse tree True
4 Confidence factor 0.25
5 Debug True
6 Do not check capabilities True
7 Do not make split point actual value False
8 Minimum number of objects 100
9 Number decimal places 2
10 Number folds 5
11 Reduced error pruning True
12 Save instance data False
13 Seed 1
14 Sub tree rising True
15 Un-pruned False
16 Use lap place False
17 Use MDL correction True
The tree visualizer show at the following diagram

Instances: 4521
Attributes: 17
Test mode: 10-fold cross-validation
Classifier model full training set
Time taken to build model: 0.48 seconds
Correctly Classified Instances 4026 89.0511 %
Incorrectly Classified Instances 495 10.9489 %
Kappa statistic 0.2333
Mean absolute error 0.1768
Root mean squared error 0.2994
Relative absolute error 86.641 %
Root relative squared error 93.7595 %
Total Number of Instances 4521
Detailed Accuracy by Class
TP FP Precisio Recall F- MCC ROC PRC Class
Rate Rate n Measure Area Area
0.983 0.820 0.902 0.983 0.941 0.281 0.729 0.941 No
0.180 0.017 0.580 0.180 0.275 0.281 0.729 0.35 Yes
Avg. 0.891 0.727 0.865 0.891 0.864 0.281 0.729 0.868
Confusion Matrix
A B Classified as
3932 68 a=no
427 94 b=yes
Based on the above predefined thresholds and limits specified by various parameters, the
tolerance for test node replacement proved to be a worthy candidate for experimentation. Weka
provides a number of sample domains, each of which varying in the number of instances, classes,
and attributes. We used the bank.arff data set in our experiments. This data set contains 17
attributes over 4026 correctly classified instances having 2 possible class labels. To assess the
performance of both pruning methods on the sample domains, a series of trials with both training
and testing sets were sent through a decision tree created by the J48 algorithm and the
classification accuracy recorded.
The result of the above dataset i.e. banking dataset is good because the correctly classified
instances are greater than the incorrectly classified instances. In my testing, we found that online
pruning is useful for reducing the size of the decision tree, but always imparts a penalty on
accuracy. On the other hand, we can use lowering the confidence in the training data (confidence
Factor) can not only reduce the tree size, but also helps in filtering out statistically irrelevant
nodes that would otherwise lead to classification errors. We conclude that several values for the
confidence factor should be tested when generating decision trees to find the most appropriate
value for the particular training set under examination.
Bonus
In this assignment, we empirical compare the performance of neural nets and decision trees
based on a data set for the detection of defects in bank data. The data set was created by image
feature extraction procedures working on bank. We consider our data set as highly complex
and containing imprecise and uncertain data’s. We explain how the data set was created and
what kinds of features were extracted from the images. Then, we explain what kind of neural
nets and induction of decision trees were used for classification. We introduce a framework for
distinguishing classification methods. We observed that the performance of neural nets is
significant better than the performance of decision trees if we are only looking for the overall
error rate. We found that more detailed analysis of the error rate is necessary in order to judge
the performance of the learning and classification method. However, the error rate cannot be
the only criteria for the comparison between the different learning methods. It is a more
complex selection process that involves more criteria’s that we describe in the following.
The parameters used the bank data
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E

20 -H a -G -R
Relation: bank_data
Instances: 4521
Attributes: 17
Test mode: 10-fold cross-validation

By using the above parameters we can visualize in the following way
REFERENCES
1. D. Michie, D.J. Spiegelhalter, and C.C. Taylor, Machine Leaning, Neural

Nets and Statistical Classsification, Ellis Horwood Series in Artificial
Intelligence, 1994Google Scholar
2. R. Klette and P. Zamperoni, Handbuch der Operatoren für die
Bildbearbeitung, Vieweg Verlagsgesellschaft, 1992Google Scholar
3. K.W. Pratt, Digital Image Processing, John Wiley & Sons, Inc. New York
Chichester Brisbane Toronto Singapore, 1991zbMATHGoogle Scholar

Ass 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ass 2

Uploaded by

Copyright:

Available Formats

University of Gondar

Submited to: Instructor Mengistu Belete (Phdc)

August 2011 E.C

For instance, a common problem is using computer algorithms to search extensive databases of

Key Take a ways

Some too easy problems solved by data mining algorithms are:

Manual data entry

identify and filter abuse.

Unsupervised learning enables a product based recommendation system. Given a purchase

records and symptoms exhibited by a patient. Adoption of ML is happening at a rapid pace

legal, technical, and medical obstacles.

Customer segmentation and lifetime value prediction

feedback loop. Azure ML platform provides an example of simulated aircraft engine run-to-

to have a progressing degradation pattern. This pattern is reflected in asset’s sensor

sensor value and changes in sensor values to historical failures.

Image recognition (computer vision)

automobiles – driverless cars, marketing campaigns, etc.

The classifier algorithm is trees J48

The tree visualizer show at the following diagram

The parameters used the bank data

Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E

Test mode: 10-fold cross-validation

1. D. Michie, D.J. Spiegelhalter, and C.C. Taylor, Machine Leaning, Neural

You might also like