## SVM classification example with performance measures using R caret

This example is a followup of hyperparameter tuning using the e1071 package in R. This time we’re using the SVM implementation from the R caret package, a binary class classification problem and some extended features that come in handy for many classification problems. For an easy start with caret take a look at one of the many presentations and intros to caret (like this one by Max Kuhn, maintainer of caret).

**Desirable features**

- Hyperparameter tuning using grid search
- Multicore processing for speedup
- Weighting of samples to compensate for unequal class sizes (not possible with all classifiers, but possible with SVM)
- Classifier outputting not only predicted classes but prediction probabilities. These can be used for extended performance measures (e.g. ROC curves)

**Dataset**

We use 2 out of the 3 classes from the standard R iris dataset (the versicolor and virginica classes). We further restrict ourselves to use only the first two features (Sepal.Length and Sepal.Width) to make well performing classification a bit more difficult. If we would use all features classification would be perfect, therefore performance measures would be rather boring.

> head(iris[order(iris$Species, decreasing = T),]) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 101 6.3 3.3 6.0 2.5 virginica 102 5.8 2.7 5.1 1.9 virginica 103 7.1 3.0 5.9 2.1 virginica 104 6.3 2.9 5.6 1.8 virginica 105 6.5 3.0 5.8 2.2 virginica 106 7.6 3.0 6.6 2.1 virginica

**Concepts**

We first split our data into a training and a test partition. Training data will be used for cross validation (CV) hyperparameter tuning and determining+building a model from the best hyperparameters. Test data will be used exclusively to do performance measures. Second, we create weights for our data samples so that each sample is weighted according to it’s corresponding class size (compensates for smaller and bigger class sizes). Samples of bigger classes will be assigned smaller weights and vice versa. As in our problem classes are exactly of equal size, weights will be equal for all samples too.

Next we search for best hyperparameters and create the according model using `caret train`

. For preprocessing we let `train`

auto scale and center our training data. There exist many other, well known and well working approaches for preprocessing, even implemented in caret (e.g. using PCA or IDA) – but we leave them out for the moment. For the parameter grid to be searched we use exponential parameter series. This way the level of detail gets adjusted along with the parameter size (be careful where to arrange for more and less details) and huge parameter spaces can be covered (with coarse detail) with the first run already. For cross validation we use 10 partitions (performance measures internally of caret will be computed as average partition performance). We allow parallel processing – and request train to create the classification model so that it computes prediction probabilities along with predicted classes.

After having found our model parameters and having created our model we let it predict our test data classes. We do that a) with predicting classes to obtain a confusion matrix and b) with prediction probabilities to create a ROC curve. For the ROC curve we have to define one of our classes as the positive class (we choose versicolor here) and the other (virginica) as the negative class. Finally, we extract the ROC curve points of equal error rate (EER), maximum accuracy, maximum kappa and minimum squared error rate (MSER). They all together give you an idea of how well your model is performing on new data.

**Script code**

# Example of SVM hyperparameter tuning and performance measures for a binary class problem with the caret package # Rainhard Findling # 2014 / 07 # library(caret) library(plyr) library(ROCR) # optional - enable multicore processing library(doMC) registerDoMC(cores=4) data(iris) # create binary problem: only use two classes x <- iris[,1:2][iris$Species %in% c('versicolor', 'virginica'),] y <- factor(iris[,5][iris$Species %in% c('versicolor', 'virginica')]) # we see that data will probably not be perfectly seperable using linear separation featurePlot(x = x, y = y) # split into train and test data. train = cv parameter grid search and model creation, test = performance analysis set.seed(123456) indexes_y_test <- createDataPartition(y = 1:length(y), times = 1, p = 0.3)[[1]] # creation of weights - also fast for very big datasets weights <- as.numeric(y[-indexes_y_test]) for(val in unique(weights)) {weights[weights==val]=1/sum(weights==val)*length(weights)/2} # normalized to sum to length(samples) model <- train(method = 'svmLinear', x = x[-indexes_y_test,], y = y[-indexes_y_test], weights = weights, maximize = T, tuneGrid = expand.grid(.C=3^(-15:15)), preProcess = c('center', 'scale'), trControl = trainControl(method = 'cv', # cross validation number = 10, # nr of cv sets # repeats = 5, # use with method=repeatcv returnResamp = 'none', # return accuracy per cv partition and parameter setting classProbs = T, # return prediction probabilities along with predicted classes # savePredictions=T, # returns all predictions (for all cv paritions) for each tuneGrid parameter set returnData = F, # disable return of training data e.g. for big data sets allowParallel = T ) ) # we see some accuracy around 0.7-0.8 model head(with(model, results[order(results$Kappa, decreasing=T),])) # confusion matrix: model predicting classes of test data table(predict.train(object = model, newdata = x[indexes_y_test,], type='raw'), y[indexes_y_test]) # prediction probabilities of test data classes probs <- predict.train(object = model, newdata = x[indexes_y_test,], type='prob')[,1] isPositiveClass <- y[indexes_y_test] == 'versicolor' # for a ROC curve there is a positive class (true match rate...) - defining that class here pred <- prediction(probs, isPositiveClass) perf <- performance(pred, 'tpr', 'fpr') # plot: either plot(perf, lwd=2, col=3) # or with(attributes(perf), plot(x=x.values[[1]], y=y.values[[1]], type='l')) # some metrics: AUC (area under curve), EER (equal error rate), MSER (minimum squared error rate), Cohen's Kappa etc. AUC <- attributes(performance(pred, 'auc'))$y.value[[1]] # area under curve df <- with(attributes(pred), data.frame(cutoffs=cutoffs[[1]], tp=tp[[1]], fn=fn[[1]], tn=tn[[1]], fp=fp[[1]], TMR=tp[[1]]/(tp[[1]]+fn[[1]]), TNMR=tn[[1]]/(tn[[1]]+fp[[1]]))) df$MSER <- with(df, sqrt((1-TMR)**2+(1-TNMR)**2)) MSER <- with(df, sqrt(TMR**2+TNMR**2)) # sqrt of minimum squared error rate = eucl. distance to point TMR=TNMR=1 i_eer <- with(df, which.min(abs(TMR-TNMR))) EER <- with(df[i_eer,], mean(c(TMR,TNMR))) # equal error rate: mean would not be required when using ROCR as it's always exact df$acc <- with(df, (tp + tn) / length(isPositiveClass)) # observed accuracy df$acc_expected <- with(df, sum(isPositiveClass) * (tp+fp) / length(isPositiveClass) + sum(!isPositiveClass) * (tn+fn) / length(isPositiveClass)) / length(isPositiveClass) # expected accuracy df$kappa <- with(df, (acc - acc_expected) / (1 - acc_expected)) # cohen's kappa # graphical representation matplot(df[,6:11], lty=1:6, lwd=2, type='l'); legend('bottomright', legend=names(df[,6:11]), lty=1:6, lwd=2, col=1:6) # characteristics for settings with best EER, MSER etc. df[i_eer,] # curve point for equal error rate df[order(df$acc, decreasing=T)[[1]],] # curve point for max accuracy df[order(df$kappa, decreasing=T)[[1]],] # curve point for max kappa df[order(df$MSER, decreasing=F)[[1]],] # curve point for minimum squared error rate

**Results**

With a test run the confusion matrix for the test data set was

versicolor virginica versicolor 14 6 virginica 2 10

which points out that 2 out of 16 versicolor samples were incorrectly classified as virginica and 6 out of 16 virginica samples were incorrectly classified as versicolor. Based on prediction probabilities we generated a ROC curve. The true match rate (TMR) indicates the rate of how much test data set samples were correctly identified as being of the positive class (in our case: versicolor). The false match rate (FMR, counterpart to true non match rate TNMR) indicates the rate of how much test data set samples were incorrectly identified as being of the positive class.Although ROC curves indicate classification performance well it’s hard to compare classifiers based directly on ROC curves. Here, different other metrics come in handy, such as the area under curve (AUC), equal error rate (EER), maximum accuracy, maximum kappa and minimum squared error rate (MSER = square of minimum euclidean distance between ROC curve and top left corner for TMR = TNMR = 1). In our case the AUC is about 0.8672, the other metrics (and their corresponding ROC curve points) are about:

cutoffs TMR TNMR accuracy accuracy_expected kappa MSER EER 0.550 0.750 0.75 0.75 0.5 0.5 0.354 Max acc. 0.515 0.875 0.75 0.8125 0.5 0.625 0.280 Max kappa 0.515 0.875 0.75 0.8125 0.5 0.625 0.280 Min MSER 0.515 0.875 0.75 0.8125 0.5 0.625 0.280