You are on page 1of 2

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and

their possible consequences, including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm that only contains conditional control statements. A decision
tree is a flowchart-like structure in which each internal node represents a “test” on an attribute
(e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the
test, and each leaf node represents a class label (decision taken after computing all attributes).
The paths from root to leaf represent classification rules. Tree based learning algorithms are
considered to be one of the best and mostly used supervised learning methods. Tree based
methods empower predictive models with high accuracy, stability and ease of interpretation.
Unlike linear models, they map non-linear relationships quite well. They are adaptable at
solving any kind of problem at hand.

DECISION TREE METHODOLOGY


Decision tree methodology is a commonly used data mining method for establishing
classification systems based on multiple covariates or for developing prediction algorithms
for a target variable. This method classifies a population into branch-like segments that
construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is
non-parametric and can efficiently deal with large, complicated datasets without imposing a
complicated parametric structure. When the sample size is large enough, study data can be
divided into training and validation datasets. Using the training dataset to build a decision
tree model and a validation dataset to decide on the appropriate tree size needed to achieve
the optimal final model. The Common usages of decision tree models include the following:
• Variable selection. The number of variables that are routinely monitored in clinical settings
has increased dramatically with the introduction of electronic data storage. Many of these
variables are of marginal relevance and, thus, should probably not be included in data mining
exercises. Like stepwise variable selection in regression analysis, decision tree methods can
be used to select the most relevant input variables that should be used to form decision tree
models, which can subsequently be used to formulate clinical hypotheses and inform
subsequent research.
• Assessing the relative importance of variables. Once a set of relevant variables is
identified, researchers may want to know which variables play major roles. Generally,
variable importance is computed based on the reduction of model accuracy (or in the purities
of nodes in the tree) when the variable is removed. In most circumstances the more records a
variable have an effect on, the greater the importance of the variable.
• Handling of missing values. A common - but incorrect - method of handling missing data
is to exclude cases with missing values; this is both inefficient and runs the risk of
introducing bias in the analysis. Decision tree analysis can deal with missing data in two
ways: it can either classify missing values as a separate category that can be analysed with the
other categories or use a built decision tree model which set the variable with lots of missing
value as a target variable to make prediction and replace these missing ones with the
predicted value.
• Prediction. This is one of the most important usages of decision tree models. Using the tree
model derived from historical data, it’s easy to predict the result for future records.
• Data manipulation. Too many categories of one categorical variable or heavily skewed
continuous data are common in medical research. In these circumstances, decision tree
models can help in deciding how to best collapse categorical variables into a more
manageable number of categories or how to subdivide heavily skewed variables into ranges.

Techniques
1. Tracking patterns. One of the most basic techniques in Decision Tree is learning to
recognize patterns in your data sets. This is usually a recognition of some aberration in your
data happening at regular intervals, or an ebb and flow of a certain variable over time. For
example, you might see that your sales of a certain product seem to spike just before the
holidays, or notice that warmer weather drives more people to your website.
2. Classification. Classification is a more complex Decision Tree technique that forces you
to collect various attributes together into discernable categories, which you can then use to
draw further conclusions, or serve some function. For example, if you’re evaluating data on
individual customers’ financial backgrounds and purchase histories, you might be able to
classify them as “low,” “medium,” or “high” credit risks. You could then use these
classifications to learn even more about those customers.
3. Association. Association is related to tracking patterns, but is more specific to dependently
linked variables. In this case, you’ll look for specific events or attributes that are highly
correlated with another event or attribute; for example, you might notice that when your
customers buy a specific item, they also often buy a second, related item. This is usually
what’s used to populate “people also bought” sections of online stores.
4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give
you a clear understanding of your data set. You also need to be able to identify anomalies, or
outliers in your data. For example, if your purchasers are almost exclusively male, but during
one strange week in July, there’s a huge spike in female purchasers, you’ll want to investigate
the spike and see what drove it, so you can either replicate it or better understand your
audience in the process.
5. Clustering. Clustering is very similar to classification, but involves grouping chunks of
data together based on their similarities. For example, you might choose to cluster different
demographics of your audience into different packets based on how much disposable income
they have, or how often they tend to shop at your store.
6. Regression. Regression, used primarily as a form of planning and modelling, is used to
identify the likelihood of a certain variable, given the presence of other variables. For
example, you could use it to project a certain price, based on other factors like availability,
consumer demand, and competition. More specifically, regression’s main focus is to help you
uncover the exact relationship between two (or more) variables in a given data set.
7. Prediction. Prediction is one of the most valuable Decision Tree techniques, since it’s
used to project the types of data you’ll see in the future. In many cases, just recognizing and
understanding historical trends is enough to chart a somewhat accurate prediction of what
will happen in the future. For example, you might review consumers’ credit histories and past
purchases to predict whether they’ll be a credit risk in the future.

You might also like