Professional Documents
Culture Documents
Faculity of Informatics
Department of Information Technology for PG program
Year-2nd
Assignment 2 for data warehousing and data mining
(IT 617)
Set by:
Tesfaye Ashebr Sec. 3
ID GUS/22461/10
Gondar, Ethiopia
ACKNOWLEDGEMENT
First of all we would like to thank our GOD that helped us in all up and down and who made us
to successfully complete this assignment effectively beyond what we had anticipated. Next we
would like to express our deepest gratitude to our Instructor Mengistu Belete and who
contributed all their idea in order to make this assignment the best as much as we need.
What Is Over-fitting?
Over-fitting is a modeling error which occurs when a function is too closely fit to a limited set of
data points. Over-fitting the model generally takes the form of making an overly complex model
to explain idiosyncrasies in the data under study.
In reality, the data often studied has some degree of error or random noise within it. Thus,
attempting to make the model conform too closely to slightly inaccurate data can infect the
model with substantial errors and reduce its predictive power.
However, when applied to data outside of the sample, such theorems may likely prove to be
merely the over-fitting of a model to what were in reality just chance occurrences. In all cases, it
is important to test a model against data which is outside of the sample used to develop it.
Over-fitting is a modeling error which occurs when a function is too closely fit to a
limited set of data points.
Financial professionals must always be aware of the dangers of over-fitting a model
based on limited data.
Inaccuracy and duplication of data are major business problems for an organization wanting to
automate its processes. Machines learning (ML) algorithms and predictive modeling algorithms
can significantly improve the situation. ML programs use the discovered data to improve the
process as more calculations are made. Thus machines can learn to perform time-intensive
documentation and data entry tasks. Also, knowledge workers can now spend more time on
higher-value problem-solving tasks. Aria, an AI based firm has developed a natural language
processing technology which scans texts and determines the relationship between concepts to
write reports.
Detecting spam
Spam detection is the earliest problem solved by ML. Four years ago, email service providers
used pre-existing rule-based techniques to remove spam. But now the spam filters create new
rules themselves using ML. Thanks to ‘neural networks’ in its spam filters, Google now boasts
of 0.1 percent of spam rate. Brain-like “neural networks” in its spam filters can learn to
recognize junk mail and phishing messages by analyzing rules across an enormous collection of
computers. In addition to spam detection, social media websites are using ML as a way to
Product recommendation
history for a customer and a large inventory of products, ML models can identify those products
in which that customer will be interested and likely to purchase. The algorithm identifies hidden
pattern among items and focuses on grouping similar products into clusters. A model of this
decision process would allow a program to make recommendations to a customer and motivate
product purchases. E-Commerce businesses such as Amazon have this capability. Unsupervised
learning along with location detail is used by Facebook to recommend users to connect with
others users.
Medical diagnosis
Machine learning in the medical field will improve patient’s health with minimum costs. Use
cases of ML are making near perfect diagnoses, recommend best medicines, predict readmissions
and identify high-risk patients. These predictions are based on the dataset of anonym zed patient
despite many hurdles, which can be overcome by practitioners and consultants who know the
Customer segmentation, churn prediction and customer lifetime value (LTV) prediction are the
main challenges faced by any marketer. Businesses have a huge amount of marketing relevant
data from various sources such as email campaign, website visitors and lead data. Using data
mining and machine learning, an accurate prediction for individual marketing offers and
incentives can be achieved. Using ML, savvy marketers can eliminate guesswork involved in
data-driven marketing. For example, given the pattern of behavior by a user during a trial period
and the past behaviors of all users, identifying chances of conversion to paid version can be
predicted. A model of this decision problem would allow a program to trigger customer
interventions to persuade the customer to convert early or better engage in the trial.
Financial analysis
Due to large volume of data, quantitative nature and accurate historical data, machine learning
can be used in financial analysis. Present use cases of ML in finance include algorithmic trading,
portfolio management, fraud detection and loan underwriting. According to Ernst and Young
report on ‘The future of underwriting’ – Machine learning will enable continual assessments of
data for detection and analysis of anomalies and nuances to improve the precision of models and
rules. And machines will replace a large no. of underwriting positions. Future applications of ML
in finance include chat-bots and conversational interfaces for customer service, security and
sentiment analysis.
Predictive maintenance
Manufacturing industry can use artificial intelligence (AI) and ML to discover meaningful
patterns in factory data. Corrective and preventive maintenance practices are costly and
inefficient. For predictive maintenance, ML architecture can be built which consists of historical
device data, flexible analysis environment, and work flow visualization tool and operations
failure events to demonstrate the predictive maintenance modeling process. The asset is assumed
measurement. In order to predict future failures, ML algorithm learns the relationship between
Computer vision produces numerical or symbolic information from images and high-dimensional
data. It involves machine learning, data mining, database knowledge discovery and pattern
recognition. Potential business uses of image recognition technology are found in healthcare,
Below are the examples of too difficult machine learning problems that really ground what
machine learning is all about?
Spam Detection: Given email in an inbox, identify those email messages that are spam
and those that are not. Having a model of this problem would allow a program to leave non-spam
emails in the inbox and move spam emails to a spam folder. We should all be familiar with this
example.
Credit Card Fraud Detection: Given credit card transactions for a customer in a month,
identify those transactions that were made by the customer and those that were not. A program
with a model of this decision could refund those transactions that were fraudulent.
Digit Recognition: Given a zip codes hand written on envelops, identify the digit for
each hand written character. A model of this problem would allow a computer program to read
and understand handwritten zip codes and sort envelops by geographic region.
Speech Understanding: Given an utterance from a user, identify the specific request
made by the user. A model of this problem would allow a program to understand and make an
attempt to fulfill that request.
Face Detection: Given a digital photo album of many hundreds of digital photographs,
identify those photos that include a given person. A model of this decision process would allow a
program to organize photos by person. Some cameras and software like iPhoto has this
capability.
Confusion Matrix
A B Classified as
3932 68 a=no
427 94 b=yes
Based on the above predefined thresholds and limits specified by various parameters, the
tolerance for test node replacement proved to be a worthy candidate for experimentation. Weka
provides a number of sample domains, each of which varying in the number of instances, classes,
and attributes. We used the bank.arff data set in our experiments. This data set contains 17
attributes over 4026 correctly classified instances having 2 possible class labels. To assess the
performance of both pruning methods on the sample domains, a series of trials with both training
and testing sets were sent through a decision tree created by the J48 algorithm and the
classification accuracy recorded.
The result of the above dataset i.e. banking dataset is good because the correctly classified
instances are greater than the incorrectly classified instances. In my testing, we found that online
pruning is useful for reducing the size of the decision tree, but always imparts a penalty on
accuracy. On the other hand, we can use lowering the confidence in the training data (confidence
Factor) can not only reduce the tree size, but also helps in filtering out statistically irrelevant
nodes that would otherwise lead to classification errors. We conclude that several values for the
confidence factor should be tested when generating decision trees to find the most appropriate
value for the particular training set under examination.
Bonus
In this assignment, we empirical compare the performance of neural nets and decision trees
based on a data set for the detection of defects in bank data. The data set was created by image
feature extraction procedures working on bank. We consider our data set as highly complex
and containing imprecise and uncertain data’s. We explain how the data set was created and
what kinds of features were extracted from the images. Then, we explain what kind of neural
nets and induction of decision trees were used for classification. We introduce a framework for
distinguishing classification methods. We observed that the performance of neural nets is
significant better than the performance of decision trees if we are only looking for the overall
error rate. We found that more detailed analysis of the error rate is necessary in order to judge
the performance of the learning and classification method. However, the error rate cannot be
the only criteria for the comparison between the different learning methods. It is a more
complex selection process that involves more criteria’s that we describe in the following.
Relation: bank_data
Instances: 4521
Attributes: 17