Classification Tree Example
In this handout I show you the S function tree used to classify the iris data. The data are in an various variable which I made from the built-in data iris. Here is my S code:
postscript("iristree.ps",horizontal=F) variety <- as.category(c(rep("Setosa", 50), rep("Versicolor", 50), rep( "Virginica", 50))) iris.tree <- tree(variety ~ sepal.length + sepal.width + petal.length + petal.width) plot(iris.tree) text(iris.tree) summary(iris.tree) plot(petal.length,petal.width,xlab="Petal Length",ylab="Petal Width",type='n') points(petal.length[1:50],petal.width[1:50],pch=1) points(petal.length[51:100],petal.width[51:100],pch=2) points(petal.length[101:150],petal.width[101:150],pch=3) abline(v=2.45) abline(h=1.75,lty=2) abline(v=4.95,lty=3) rt <- (petal.width < 1.75) & (variety > 1) plot(petal.length[rt],sepal.length[rt],xlab="Petal Length",ylab="Sepal Length",type='n') points(petal.length[rt & variety ==2],sepal.length[rt & variety ==2],pch=2) points(petal.length[rt & variety ==3],sepal.length[rt & variety ==3],pch=3) abline(v=4.95) abline(h=5.15,lty=2) dev.off()
I create the categorical variable variety to hold the labels for the three types of Iris and then essentially regress this categorical variable on the covariates shown. The statement summary(iris.tree) produces:
Classification tree: tree(formula = variety ~ sepal.length + sepal.width + petal.length + petal.width) Variables actually used in tree construction: [1] "petal.length" "petal.width" "sepal.length" Number of terminal nodes: 6 Residual mean deviance: 0.1253 = 18.05 / 144 Misclassification error rate: 0.02667 = 4 / 150The jargon about deviance results from fitting a multinomial model to the probability that a given flower comes from a given species given the covariates. The model has the X's as fixed covariates and the responses being the vector of 150 varieties. Though the model does not match the sampling scheme it often produces good classifiers.