You are on page 1of 32

Decision Trees

Presentation no.1

1 Submitted to : Dr. Muhammad Naeem


Presented by: Nosheen Fayyaz
2 Contents

 Definitions
 Why should we use Decision Trees
 The basic algorithm of Decision Trees (overview)
 Common steps for using Decision Trees
 Disadvantages
 Application of Decision Trees in NLP
3 Decision Trees

 Definition:
 The decision tree method is a powerful statistical tool for
classification, prediction, interpretation, and data
manipulation that has several potential applications.
 Non-parametric approach without distributional assumptions.
 Decision tree can also be re-represented as
if-then rules to improve human readability
4 Why should we use Decision Trees?

• Simplifies complex relationships between input variables and target


variables by dividing original input variables into significant subgroups.
• Easy to understand and interpret.
• Easy to handle missing values without needing to resort to imputation.
• Easy to handle heavy skewed data without needing to resort to data
transformation.
• Robust to outliers
5 Why should we use Decision Trees?

 Decision trees can be used when


 Instances can be described by attribute-value pairs
 Target function is discrete valued (classification)
 Disjunctive hypothesis may be required that searches a
completely expressive hypothesis space
 Used for inductive inferencing
 Possibly noisy training data
6 The Basic DTL Algorithm (overview)

Top-down, greedy search through the space of possible decision trees


(ID3 and C4.5)

In construction process, the important thing is the selection of attribute


used for splitting the example set in to different classes.
 Root: best attribute for classification that would be decided on the value of
information gain
7 The Basic DTL Algorithm (overview)

Information gain
A quantitative measure of the worth of an attribute or Measures the
expected reduction in entropy given the value of some attribute A
Gain(S,A)  Entropy(S) - vValues(A) |Sv|Entropy(S)/|S|

Values(A): Set of all possible values for attribute A


Sv: Subset of S for which attribute A has value v
8 The Basic DTL Algorithm (overview)
Entropy
Entropy(S) - p+ log2 p+ - p- log2 p-

p+(-) = proportion of positive (negative)


examples

 Entropy specifies the minimum number


of bits of information needed to encode
the classification of an arbitrary
member of S
 In general:
 Entropy(S) = -  i=1,c pi log2 pi
9 Common Steps for using decision trees (1)

 Variable selection.
 To select the most relevant input variables that should be used to
form decision tree models.
 Assessing the relative importance of variables.
 Generally, variable importance is computed based on the
reduction of model accuracy, when the variable is removed. In
most circumstances the more records a variable have an effect
on, the greater the importance of the variable.
10 Common Steps for using decision trees (2)

 Handling of missing values.


 A common but incorrect method is to exclude missing values; which
introduces bias in the analysis and inefficiency.
 Decision tree analysis can deal with missing data in two ways:
 Either classify missing values as a separate category
 Or set the variable with lots of missing value as a target variable to make
prediction and replace these missing ones with the predicted value.
11 Common Steps for using decision trees (3)
 Nodes specifies some attributes to be classified
 Root node, Internal node, Leaf node
 Branches corresponds to one of the possible values of the attributes
 Splitting.
 Only input variables related to the target variable are used
 Identify the most important input variables, and then split records at the root node and at subsequent internal
nodes
 Characteristics that are related to the degree of ‘purity’ (all the records have the target outcome) of the
resultant child nodes include entropy (measure of disorder), Gini index, classification error, information gain,
gain ratio, and twoing criteria.
 This splitting procedure continues until pre-determined homogeneity or stopping criteria are met.
12 Common Steps for using decision trees (4)

 Stopping
 All decision trees need stopping criteria or it would be possible, and undesirable, to
grow a tree in which each case occupied its own node. The resulting tree would be
computationally expensive, difficult to interpret and would probably not work very
well with new data
 Number of cases in the node is less than some pre-specified limit.
 Purity of the node is more than some pre-specified limit.
 Depth of the node is more than some pre-specified limit.
 Predictor values for all records are identical - in which no rule could be
generated to split them.
13 Common Steps for using decision trees (5)

 Pruning.
 In some situations, stopping rules do not work well. An alternative way to build a
decision tree model is to grow a large tree first, and then prune it to optimal size by
removing nodes that provide less additional information.
 Two types
 pre-pruning (forward pruning) uses Chi-square tests or multiple-comparison
adjustment methods to prevent the generation of non-significant branches.
 Post-pruning is used after generating a full decision tree to remove branches in a
manner that improves the accuracy of the overall classification when applied to the
validation dataset.
14 Common Steps for using decision trees (6)

 Pruning can be done by the following techniques


 selects the best possible sub-tree from several candidates to consider the proportion of
records with error prediction
 selecting the best alternative is to use a validation dataset (i.e., dividing the sample in two
and one sample is used to develop the model (training dataset), while the other is used to
test the model (validation dataset).
 For small samples, cross validation (i.e., dividing the sample in 10 groups or ‘folds’, the
model developed from 9 folds and tested on the 10th fold, and averaging the rates or
erroneous predictions).
15 Common Steps for using decision trees (7)

 Prediction.
 This is one of the most important usages of decision tree models to predict the
result for future records.
16 Disadvantages

 It can be subject to overfitting and underfitting, particularly when using a small data set.
 This can limit the generalizability and robustness of the resultant models.
 Strong correlation between different potential input variables may result in the selection of
variables that improve the model statistics but are not causally related to the outcome of
interest.
Application of Decision Trees in NLP

17
Part of speech (POS) tagging
Text Classification
18 Application of Decision Trees in NLP

 Three approaches are used for part of speech tagging


 Linguistic
 Uses manually written set of rules/constraints
 Statistics
 Uses statistical models, lexical and transitional probabilities
 Problem in adopting the tagger for other languages, and lack of annotated corpora
 Accuracy of taggers is 96-97%, Need of high accuracy for known and unknown words
 Machine Learning
 Automatically learn a set of transformation rules
19 Use of Decision Trees for POS tagging

 Description of the training corpus and the word form lexicon


 Training corpus: A portion of 1,170,000 words of the WSJ(Wall Street Journal)
 tag set: Penn Treebank (45 different tags), to train and test.
 Ambiguity: About 36.5% of the words in the corpus, ambiguity ratio of 2.44
tags/word over the ambiguous words, 1.52 overall.
 Word form lexicon: created a 49,206 entries lexicon with the associated lexical
probabilities for each word.
20 POS Tagging

 The heuristic is choosing for each word its most probable tag
according to the lexical probability.
 Choosing the proper syntactic tag for a word in a particular context
can be stated as a problem of classification.
 Learning algorithm would be used for a set of possible tags,
 Classes are identified with tags.
It is possible to group all the words appearing in the corpus according to the set of
21
their possible tags called ambiguity classes.
Taxonomy extracted from the WSJ. The general POS tagging problem is split into
one classification problem for each ambiguity class.
22 Treetagger
 Classify the word using the corresponding decision tree. The ambiguity of
the context (either left or right) during classification may generate
multiple answers for the questions of the nodes. In this case, all the paths
are followed and the result is taken as a weighted average of the results of
all possible paths. The weight for each path is actually its probability.
 Use the resulting probability distribution to update the probability
distribution of the word.
 Discard the tags with almost zero probability, that is, those with
probabilities lower than a certain discard boundary parameter.
23 Treetagger
After the stopping criterion is satisfied, some words could still remain
ambiguous. Then there are two possibilities:
1) Choose the most probable tag for each still-ambiguous word to
completely disambiguate the text.
2) Accept the residual ambiguity.

 Pruning the tree.


 In order to decrease the effect of over-fitting, a post pruning technique is
implemented . Experimental tests have shown that in our domain, the
pruning process reduces tree sizes up to 50% and improves their accuracy
by 2–5%
24
25 Text Classification
 Early the text classification was carried out by an expert system using the if-then
rule
 In 90s, efforts were made to use the machine learning algorithms
 a general inductive process (the learner) is fed with a set of “training” documents,
pre-classified according to the categories of interest.
 By observing the characteristics of the training documents, the learner may
generate a model (the classifier) of the conditions that are necessary for a
document to belong to any of the categories considered.
 This model can subsequently be applied to previously unseen documents for
classifying them according to these categories.
26 Text Classification
 Advantages over the knowledge engineering approach.
 A higher degree of automation is introduced: The engineer needs to
build not a text classifier, but an automatic builder of text classifiers
(the learner). Once built, the learner can then be applied to
generating many different classifiers for many different domains and
applications; one only needs to feed it with the appropriate sets of
training documents.
 Accuracy of the classifiers is more than knowledge engineering
27 A general process of text classification
28 Classifiers for text classification

 Bayesian classifier
 Decision Tree
 K-nearest neighbor(KNN)
 Support Vector Machines(SVMs)
 Neural Networks
 Rocchio’s.
29 How decision tree works for text
classification?
 When decision tree is used for text classification it consist tree
internal node are label by term, branches departing from them are
labeled by test on the weight, and leaf node are represented by
corresponding class labels .
 Tree can classify the document by running through the query
structure from root to until it reaches a certain leaf, which represents
the goal for the classification of the document.
30
 Advantages
 Simplicity in understanding and interpreting, even for non-expert
users.
 The multi-label document reduce cost of induction.
 Decision-tree-based symbolic rule induction system for text
categorization also improves text classification.
 Disadvatage
 Most of training data will not fit in memory decision tree
construction it becomes inefficient due to swapping of training
tuples. This issue was handled by using numeric and categorical
data.
31 Which classifier to use?

 Different algorithms perform differently depending


on data collection.
 However, to the certain extent SVM with term
weighted VSM (vector space model)
representation scheme performs well in many text
classification tasks.
32 References

 Song Y, Lu Y. Decision tree methods: applications for classification and prediction.


Shanghai Arch Psychiatry. 2015;27:130–5.
 Màrquez, Lluís, and Horacio Rodríguez. "Part-of-speech tagging using decision
trees." European Conference on Machine Learning. Springer, Berlin, Heidelberg, 1998.
 Màrquez, Lluís, Lluis Padro, and Horacio Rodriguez. "A machine learning approach to
POS tagging." Machine Learning 39.1 (2000): 59-91.
 Sebastiani, Fabrizio. "Text categorization." Encyclopedia of Database Technologies and
Applications. IGI Global, 2005. 683-687.
 Korde, Vandana, and C. Namrata Mahender. "Text classification and classifiers: A
survey." International Journal of Artificial Intelligence & Applications 3.2 (2012): 85.

You might also like