You are on page 1of 12

University of Gondar

Faculity of Informatics
Department of Information Technology for PG program
Year-2nd
Assignment 2 for data warehousing and data mining
(IT 617)

Set by:
Tesfaye Ashebr Sec. 3
ID GUS/22461/10

Submited to: Instructor Mengistu Belete (Phdc)

August 2011 E.C

Gondar, Ethiopia
ACKNOWLEDGEMENT

First of all we would like to thank our GOD that helped us in all up and down and who made us
to successfully complete this assignment effectively beyond what we had anticipated. Next we
would like to express our deepest gratitude to our Instructor Mengistu Belete and who
contributed all their idea in order to make this assignment the best as much as we need.
What Is Over-fitting?
Over-fitting is a modeling error which occurs when a function is too closely fit to a limited set of
data points. Over-fitting the model generally takes the form of making an overly complex model
to explain idiosyncrasies in the data under study.

In reality, the data often studied has some degree of error or random noise within it. Thus,
attempting to make the model conform too closely to slightly inaccurate data can infect the
model with substantial errors and reduce its predictive power.

For instance, a common problem is using computer algorithms to search extensive databases of


historical market data in order to find patterns. Given enough study, it is often possible to
develop elaborate theorems which appear to predict things such as returns in the stock market
with close accuracy.

However, when applied to data outside of the sample, such theorems may likely prove to be
merely the over-fitting of a model to what were in reality just chance occurrences. In all cases, it
is important to test a model against data which is outside of the sample used to develop it.

Key Take a ways

 Over-fitting is a modeling error which occurs when a function is too closely fit to a
limited set of data points.
 Financial professionals must always be aware of the dangers of over-fitting a model
based on limited data.

Some too easy problems solved by data mining algorithms are:

Manual data entry

Inaccuracy and duplication of data are major business problems for an organization wanting to

automate its processes. Machines learning (ML) algorithms and predictive modeling algorithms

can significantly improve the situation. ML programs use the discovered data to improve the

process as more calculations are made. Thus machines can learn to perform time-intensive

documentation and data entry tasks. Also, knowledge workers can now spend more time on
higher-value problem-solving tasks. Aria, an AI based firm has developed a natural language

processing technology which scans texts and determines the relationship between concepts to

write reports.

Detecting spam

Spam detection is the earliest problem solved by ML. Four years ago, email service providers

used pre-existing rule-based techniques to remove spam. But now the spam filters create new

rules themselves using ML. Thanks to ‘neural networks’ in its spam filters, Google now boasts

of 0.1 percent of spam rate. Brain-like “neural networks” in its spam filters can learn to

recognize junk mail and phishing messages by analyzing rules across an enormous collection of

computers. In addition to spam detection, social media websites are using ML as a way to

identify and filter abuse.

Product recommendation

Unsupervised learning enables a product based recommendation system. Given a purchase

history for a customer and a large inventory of products, ML models can identify those products

in which that customer will be interested and likely to purchase. The algorithm identifies hidden

pattern among items and focuses on grouping similar products into clusters. A model of this

decision process would allow a program to make recommendations to a customer and motivate

product purchases. E-Commerce businesses such as Amazon have this capability. Unsupervised

learning along with location detail is used by Facebook to recommend users to connect with

others users.

Medical diagnosis

Machine learning in the medical field will improve patient’s health with minimum costs. Use

cases of ML are making near perfect diagnoses, recommend best medicines, predict readmissions
and identify high-risk patients. These predictions are based on the dataset of anonym zed patient

records and symptoms exhibited by a patient. Adoption of ML is happening at a rapid pace

despite many hurdles, which can be overcome by practitioners and consultants who know the

legal, technical, and medical obstacles.

Customer segmentation and lifetime value prediction

Customer segmentation, churn prediction and customer lifetime value (LTV) prediction are the

main challenges faced by any marketer. Businesses have a huge amount of marketing relevant

data from various sources such as email campaign, website visitors and lead data. Using data

mining and machine learning, an accurate prediction for individual marketing offers and

incentives can be achieved. Using ML, savvy marketers can eliminate guesswork involved in

data-driven marketing. For example, given the pattern of behavior by a user during a trial period

and the past behaviors of all users, identifying chances of conversion to paid version can be

predicted. A model of this decision problem would allow a program to trigger customer

interventions to persuade the customer to convert early or better engage in the trial.

Financial analysis

Due to large volume of data, quantitative nature and accurate historical data, machine learning

can be used in financial analysis. Present use cases of ML in finance include algorithmic trading,

portfolio management, fraud detection and loan underwriting. According to Ernst and Young

report on ‘The future of underwriting’ – Machine learning will enable continual assessments of

data for detection and analysis of anomalies and nuances to improve the precision of models and

rules. And machines will replace a large no. of underwriting positions. Future applications of ML

in finance include chat-bots and conversational interfaces for customer service, security and

sentiment analysis.
Predictive maintenance

Manufacturing industry can use artificial intelligence (AI) and ML to discover meaningful

patterns in factory data. Corrective and preventive maintenance practices are costly and

inefficient. For predictive maintenance, ML architecture can be built which consists of historical

device data, flexible analysis environment, and work flow visualization tool and operations

feedback loop. Azure ML platform provides an example of simulated aircraft engine run-to-

failure events to demonstrate the predictive maintenance modeling process. The asset is assumed

to have a progressing degradation pattern. This pattern is reflected in asset’s sensor

measurement. In order to predict future failures, ML algorithm learns the relationship between

sensor value and changes in sensor values to historical failures.

Image recognition (computer vision)

Computer vision produces numerical or symbolic information from images and high-dimensional

data. It involves machine learning, data mining, database knowledge discovery and pattern

recognition. Potential business uses of image recognition technology are found in healthcare,

automobiles – driverless cars, marketing campaigns, etc.

Below are the examples of too difficult machine learning problems that really ground what
machine learning is all about?

 Spam Detection: Given email in an inbox, identify those email messages that are spam
and those that are not. Having a model of this problem would allow a program to leave non-spam
emails in the inbox and move spam emails to a spam folder. We should all be familiar with this
example.
 Credit Card Fraud Detection: Given credit card transactions for a customer in a month,
identify those transactions that were made by the customer and those that were not. A program
with a model of this decision could refund those transactions that were fraudulent.
 Digit Recognition: Given a zip codes hand written on envelops, identify the digit for
each hand written character. A model of this problem would allow a computer program to read
and understand handwritten zip codes and sort envelops by geographic region.
 Speech Understanding: Given an utterance from a user, identify the specific request
made by the user. A model of this problem would allow a program to understand and make an
attempt to fulfill that request.
 Face Detection: Given a digital photo album of many hundreds of digital photographs,
identify those photos that include a given person. A model of this decision process would allow a
program to organize photos by person. Some cameras and software like iPhoto has this
capability.

The classifier algorithm is trees J48


No List of parameters Value
1 Batch size 100
2 Binary split True
3 Collapse tree True
4 Confidence factor 0.25
5 Debug True
6 Do not check capabilities True
7 Do not make split point actual value False
8 Minimum number of objects 100
9 Number decimal places 2
10 Number folds 5
11 Reduced error pruning True
12 Save instance data False
13 Seed 1
14 Sub tree rising True
15 Un-pruned False
16 Use lap place False
17 Use MDL correction True

The tree visualizer show at the following diagram


Instances: 4521
Attributes: 17
Test mode: 10-fold cross-validation
Classifier model full training set
Time taken to build model: 0.48 seconds
Correctly Classified Instances 4026 89.0511 %
Incorrectly Classified Instances 495 10.9489 %
Kappa statistic 0.2333
Mean absolute error 0.1768
Root mean squared error 0.2994
Relative absolute error 86.641 %
Root relative squared error 93.7595 %
Total Number of Instances 4521
Detailed Accuracy by Class
TP FP Precisio Recall F- MCC ROC PRC Class
Rate Rate n Measure Area Area
0.983 0.820 0.902 0.983 0.941 0.281 0.729 0.941 No
0.180 0.017 0.580 0.180 0.275 0.281 0.729 0.35 Yes
Avg. 0.891 0.727 0.865 0.891 0.864 0.281 0.729 0.868

Confusion Matrix
A B Classified as
3932 68 a=no
427 94 b=yes

Based on the above predefined thresholds and limits specified by various parameters, the
tolerance for test node replacement proved to be a worthy candidate for experimentation. Weka
provides a number of sample domains, each of which varying in the number of instances, classes,
and attributes. We used the bank.arff data set in our experiments. This data set contains 17
attributes over 4026 correctly classified instances having 2 possible class labels. To assess the
performance of both pruning methods on the sample domains, a series of trials with both training
and testing sets were sent through a decision tree created by the J48 algorithm and the
classification accuracy recorded.

The result of the above dataset i.e. banking dataset is good because the correctly classified
instances are greater than the incorrectly classified instances. In my testing, we found that online
pruning is useful for reducing the size of the decision tree, but always imparts a penalty on
accuracy. On the other hand, we can use lowering the confidence in the training data (confidence
Factor) can not only reduce the tree size, but also helps in filtering out statistically irrelevant
nodes that would otherwise lead to classification errors. We conclude that several values for the
confidence factor should be tested when generating decision trees to find the most appropriate
value for the particular training set under examination.
Bonus

In this assignment, we empirical compare the performance of neural nets and decision trees
based on a data set for the detection of defects in bank data. The data set was created by image
feature extraction procedures working on bank. We consider our data set as highly complex
and containing imprecise and uncertain data’s. We explain how the data set was created and
what kinds of features were extracted from the images. Then, we explain what kind of neural
nets and induction of decision trees were used for classification. We introduce a framework for
distinguishing classification methods. We observed that the performance of neural nets is
significant better than the performance of decision trees if we are only looking for the overall
error rate. We found that more detailed analysis of the error rate is necessary in order to judge
the performance of the learning and classification method. However, the error rate cannot be
the only criteria for the comparison between the different learning methods. It is a more
complex selection process that involves more criteria’s that we describe in the following.

The parameters used the bank data

Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E


20 -H a -G -R

Relation: bank_data

Instances: 4521

Attributes: 17

Test mode: 10-fold cross-validation


By using the above parameters we can visualize in the following way
REFERENCES

1. D. Michie, D.J. Spiegelhalter, and C.C. Taylor, Machine Leaning, Neural


Nets and Statistical Classsification, Ellis Horwood Series in Artificial
Intelligence, 1994Google Scholar
2. R. Klette and P. Zamperoni, Handbuch der Operatoren für die
Bildbearbeitung, Vieweg Verlagsgesellschaft, 1992Google Scholar
3. K.W. Pratt, Digital Image Processing, John Wiley & Sons, Inc. New York
Chichester Brisbane Toronto Singapore, 1991zbMATHGoogle Scholar

You might also like