Attribution Non-Commercial (BY-NC)

103 views

Attribution Non-Commercial (BY-NC)

- Predicting Solubilities of Cellulose in Ionic Liquids using COSMO-RS
- Hadoop Training #5: MapReduce Algorithm
- Big Data Thesis
- ml-4up-04-mlabelclassification
- Basic Overview of Chemoinformatics
- Random forest
- Solubility
- breiman2001randomforests
- Using Random Forests v4.0
- Paracetamol Solubility in Pure Solvents
- Random Forest
- Handbook of Chemoinformatics-From Data to Knoledge
- Rapidminer 4.6 Tutorial
- Random Forests - Classification Description
- fulltext_RF
- Classification as a Tool in Rock Engineering
- Random Forest
- Tunnel Assistant - Guia Rapida de Referencia.doc
- QSAR
- Molecular Descriptors Family on Structure Activity Relationships 5. Antimalarial Activity of 2,4-Diamino-6-Quinazoline Sulfonamide Derivates

You are on page 1of 31

consistency

Mikhail Traskin

University of Pennsylvania

The Wharton School

Department of Statistics

Random Forests

Ensemble classification (and regression) algorithm

Random Forests

Ensemble classification (and regression) algorithm

Proposed by Leo Breiman in 1999

Random Forests

Ensemble classification (and regression) algorithm

Proposed by Leo Breiman in 1999

Easy to implement

Random Forests

Ensemble classification (and regression) algorithm

Proposed by Leo Breiman in 1999

Easy to implement

Very effective in applications, has good generalization

properties

Random Forests

Ensemble classification (and regression) algorithm

Proposed by Leo Breiman in 1999

Easy to implement

Very effective in applications, has good generalization

properties

Algorithm outputs more information than just class

label

Breiman’s Experiments

Votes 4.8 4.1 N/A

German credit 23.5 24.4 N/A

Letters 3.4 3.5 N/A

Sat-images 8.8 8.6 N/A

Zip-code 6.2 6.3 0.025

Waveform 17.8 17.2 N/A

Twonorm 4.9 3.9 N/A

Classification or Regression Problem

We are given

Sn = {(Xi , Yi )}ni=1 — set of i.i.d. observations

distributed as P.

Xi ∈ X — predictors

Yi ∈ Y — responses

Goal: find fn = A(Sn ) s.t. E(ℓ(fn (X), Y )) is minimized.

Abstract Definition

Breiman (2001) defines random forest as follows.

Definition 1 A random forest is a classifier consisting of a

collection of tree-structured classifiers {h(x, Θk ), k = 1, . . .}

where the Θk are independent identically distributed

random vectors and each tree casts a unit vote for the

most popular class at input x.

The Random Forests Algorithm

1. Choose T —number of trees to grow.

2. Choose m—number of variables used to split each node.

m ≪ M , where M is the number of input variables. m is hold

constant while growing the forest.

3. Grow T trees. When growing each tree do the following.

(a) Construct a bootstrap sample of size n sampled from Sn with

replacement and grow a tree from this bootstrap sample.

(b) When growing a tree at each node select m variables at

random and use them to find the best split.

(c) Grow the tree to a maximal extent. There is no pruning.

4. To classify point X collect votes from every tree in the forest and

then use majority voting to decide on the class label.

Compare to: Bagging

Breiman, 1996

Works with any classification algorithm

Like Random Forests uses bootstrapping

Treats the underlying classification algorithm as a

"black box"

Variance reduction technique

Compare to: Random Split Selection

Dietterich, 2000

Grow multiple trees

When splitting, choose split uniformly at random from

K best splits

Can be used with or without pruning

Compare to: Random Subspace

Ho, 1998

Grow multiple trees

Each tree is grown using a fixed subset of variables

Do a majority vote or averaging to combine votes from

different trees

RF and Error Estimation

1. For each pairs (xi , yi ) in the training sample

Select only trees that do not contain the pair

Classify the pair with each of the selected trees

Compute misclassification rate for the pair

2. Average over computed estimates

RF and Variable Selection

1. For each tree in the forest

Classify out-of-bag cases and count number of

correct votes

Permute variable m in the out-of-bag sample

Classify permuted out-of-bag sample and count

number of correct votes

Compute the difference between the unpermuted

and permuted counts

2. Compute the average and sd of the differences

3. Compute z-statistic

RF and Interactions

Gini importance for each variable

Rank gini importance scores for each tree

For each pair of variables compute the average rank

difference over all trees

Unsupervised Learning

(Dis)similarity measure

For each tree put all the training sample down the tree

For each pair of observations compute fraction of trees

sij where they end up in the same node

p

Compute dissimilarity as dij = 1 − sij

Unsupervised Learning

Synthetic datasets

Mark observed data as “observed”

Generate a synthetic sample from the product of

marginal of observed data

Mark generated data as “unobserved”

Unsupervised Learning

Clustering

Train random forest on the synthetic data

Use the forest to compute the dissimilarity measure

only for the observed data

Use any clustering algorithm with the computed

dissimilarity measure

Universal Consistency

Assume i.i.d. data (X, Y ), Sn = {(Xi , Yi )}ni=1 from

X × Y, with Y = {−1, 1}.

Consider a method fn = A(Sn ), for example

fn = AdaBoost(Sn , tn ).

Definition 2 Method is universally consistent if for any

distribution P

a.s. ∗

L(fn ) →L ,

where L is the risk and L∗ is the Bayes risk:

f

Is Random Forests Consistent?

Breiman (2001) wrote:

Section 2 gives some theoretical background for random

forests. Use of the Strong Law of Large Numbers shows

that they always converge so that overfitting is not a

problem.

Is Random Forests Consistent?

Breiman (2001) wrote:

Section 2 gives some theoretical background for random

forests. Use of the Strong Law of Large Numbers shows

that they always converge so that overfitting is not a

problem.

···

This result explains why random forests do not overfit as

more trees are added, but produce a limiting value of the

generalization error.

One-Dimensional Case

Theorem 3 Consider binary classification problem. If

X = R then classification Random Forests algorithm is

equivalent to 1-nearest neighbor classifier and hence is not

consistent.

Theorem 4 Consider binary classification problem. If

X = R and bootstrap sample size k → ∞ s.t. k = o(n) then

classification Random Forests algorithm is consistent.

One-Dimensional Case

X = [0, 1], η(x) = P(Y = 1|x) = 0.25 + 0.5I{x≥0.5} ,

L1N N = 0.375

One-Dimensional Case

Two-Dimensional Case

Two-Dimensional Case

Four-Dimensional Case

Eight-Dimensional Case

Four-Dimensional Case

Decision boundary: hyperplane

Other versions of ensemble classifiers

Biau et al. (2007)

Consistency of purely random forest

Consistency of bagged nearest neighbor rules

Consistency of forest consisting of trees based on the

partitioning the space into nested rectangles

- Predicting Solubilities of Cellulose in Ionic Liquids using COSMO-RSUploaded byCristian González
- Hadoop Training #5: MapReduce AlgorithmUploaded byDmytro Shteflyuk
- Big Data ThesisUploaded bygapeezee
- ml-4up-04-mlabelclassificationUploaded byRahaf Ahmad
- Basic Overview of ChemoinformaticsUploaded bykinjalpc
- Random forestUploaded byadeka1
- SolubilityUploaded byKenneth Dayrit
- breiman2001randomforestsUploaded bygr650
- Using Random Forests v4.0Uploaded byrollschach
- Paracetamol Solubility in Pure SolventsUploaded byValentino Dhiyu
- Random ForestUploaded byv
- Handbook of Chemoinformatics-From Data to KnoledgeUploaded byMaria Veleganova
- Rapidminer 4.6 TutorialUploaded byamru_rzl
- Random Forests - Classification DescriptionUploaded byBilal Nizami
- fulltext_RFUploaded byYogesh Kumkar
- Classification as a Tool in Rock EngineeringUploaded bywinminthetgeo
- Random ForestUploaded byColan Vlad
- Tunnel Assistant - Guia Rapida de Referencia.docUploaded byJessica Watson
- QSARUploaded byMudit Misra
- Molecular Descriptors Family on Structure Activity Relationships 5. Antimalarial Activity of 2,4-Diamino-6-Quinazoline Sulfonamide DerivatesUploaded byIrwan Hikmawan
- A Naive Bayes Application Twitter Sentiment AnalysisUploaded byInternational Journal of Innovative Science and Research Technology
- R Random Forest GuideUploaded byOrangeBosco
- Molecular DescriptorsUploaded byRintu Kutum
- Texture ClassificationUploaded bykesavansr
- Molecular DescriptorsUploaded byShruti Tiwari
- Random ForestUploaded byKarim Ghariani
- Random ForestUploaded bySandraSarac
- Refinement and Parametrization of COSMO-RSUploaded bycymy
- Random ForestUploaded byniks53
- ClassificationUploaded byOm Prakash Sharma

- Adaline/Madaline:ApplicationsUploaded byRoots999
- xcoax catálogoUploaded byNunca Mas Saldre
- Opportunities and Challenges of Instagram Algorithm in Improving Competitive AdvantageUploaded byInternational Journal of Innovative Science and Research Technology
- 1212.2129Uploaded byJayakrishna
- Joint Sentiment Topic Model for Sentiment AnalysisUploaded byHung Hanviet
- AI LABUploaded bydeepak dash
- 6 most impUploaded bysuchi87
- Artifical Intelligence Implementation Challenges in Embedded DesignUploaded byFossilShale Embedded Technologies
- CAse Study ANNUploaded bylouiscouto123
- sharda_dss10_ppt_06.pptxUploaded byRagini P
- Matlab NN ToolboxUploaded byLuffi Muhammad Nur
- Revista Astro RoobotUploaded byjavier
- VIJNANA VEEKSHA_SCIENCE MAGAZINE.pdfUploaded byjanurang
- robots in societyUploaded byapi-290388305
- EAE Robotics 00Uploaded byAhmed Yehia
- Face Recognition Using Laplacian FacesUploaded byAditya Kumar
- Advanced Analytics and the CfoUploaded byFernando
- trainingalgorithmsUploaded byenn.kuutan9633
- Natural Language ProcessingUploaded byArivanandan Get Going
- Gaming, AI and DevUploaded byRoga Alexandru
- Performance Evaluation of Clustering AlgorithmsUploaded byseventhsensegroup
- ex4Uploaded bysiddarthkanted
- Lecture1 - Intro to Cog Psych.pptUploaded byCaterina Carbone
- 1 Page Intro H-Ideas en v1 Jeudi, 14 Septembre 2017 15-35-04Uploaded byAndi Cahya
- Changes Of Tomorrow - HyperIslandUploaded byMelissa Pozatti
- Dendrite Morphological Neurons by Differential Evolution, SSCI 2016Uploaded byErik Zamora
- Quarterly Insurtech Briefing q4 2017Uploaded byrichpageant
- A Neural Network-Based Classification of Water Resource ImagesUploaded byEditor IJRITCC
- robotics:an introductionUploaded byanil415
- A Comparison of Shadow Detection Removal and Reconstruction MethodsUploaded byIJSTE

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.