Savitzky-Golay Filters: Approximating Time Series using Polygons with an Example in R

Savitzky-Golay Filter Example

Continuous data streams (“time series data”) are usually smoothed before data processing is applied on them. For this purpose both the running mean filter (also called moving/rolling mean/average) and the related running median filter are frequently used. Both have the disadvantage of “cutting off” peaks. This is a side effect of not trying to approximate a given signal in the best way. That is, they apply a simple function without incorporating the error they introduce on the signal during the approximation.

As alternative to such approaches a Savitzky-Golay filter can be used. It tries to approximate a given signal using a sliding window approach and a low degree polynomial to model data within that window. In contrast to running mean/median it also incorporates the introduced error in the approximation process using linear least squares. This leads to not “simply cutting off peaks” but modeling them in the best way possible, just as the rest of the data.

Here’s a simple example of a Savizky-Golay filter in comparison to running mean/median in R on an excerpt of the beaver data:

matplot(data.frame( beaver1[,3], # original data
runmed(beaver1[,3], k = 11), # with running median filter
filter(filt = sgolay(p = 5, n = 11), x = beaver1[,3]) # with SG filter
), type='l', lwd=2, lty=1, ylab='')
legend('topleft', legend=c('original', 'runmed', 'Savitzky–Golay'), col=1:3, lty=1, lwd=2)


Savitzky-Golay Filter Example
Savitzky-Golay Filter Example

Install R from source: a simple way to compile R from source and use it in RStudio or on servers without X

R Logo

Sometimes we need to use special versions of R (e.g. the latest version of R is not available in the repositories of the machine we need to use). This post describes a simple way of compiling R from sources and using it in RStudio or on servers without X.

Get latest R source

Get the source of your desired R version from You want to obtain a file named R.x.y.z.tar from there and untar it using tar -xf R-x.y.z.tar.gz. Change into the untar-ed folder then.

More details on this are available at

Build R from source for RStudio

Built R using the following commands. Be aware that there are a number of prerequisites to this compilation, like having gcc and g++ installed.

./configure --enable-R-shlib
make check
make check-all


Passing the checks creates the R and Rscript frontends in ./bin/ — those are what you most likely need. You can call these directly, link to them, or use them in RStudio (see below). If you forgot to add the –enable-R-shlib parameter to configure you need to delete and re-extract the tar before redoing the process — otherwise make will fail.

More details on this are available at

Define the R to use with RStudio

RStudio uses the R defined by which R. Hence we can add the following to the ~/.profile file to define which R is used by RStudio:

# export RSTUDIO_WHICH_R=/path/to/your/R # this is what RStudio usually uses
export RSTUDIO_WHICH_R=/usr/local/bin/R  # this is what we use
. ~/.profile # reloads the profile without requiring us to logout+login again

More details on this are available at

Build R for environments without X/without Java

Do the same steps as above, but use this configure instead:

./configure --with-x=no --disable-java

More details on this are available at

Hint for “package not available for R version x.z.y…” errors

If package installations fail because your repositories don’t contain the required SW, try the Berkeley mirror: from our experience, they host lots of packages in many versions. For example:

install.packages("ggplot2", repos="")

Alternatively, the URL to the source archive can be specified directly:

packageurl <- ""
install.packages(packageurl, contriburl=NULL, type="source")

More details on this are available at

Yet another option is to use devtools to fetch code e.g. directly from GitHub:


More details on this example can be found at


SVM classification example with performance measures using R caret

The caret package (short for Classification And REgression Training)
The caret package (short for Classification And REgression Training)

This example is a followup of hyperparameter tuning using the e1071 package in R. This time we’re using the SVM implementation from the R caret package, a binary class classification problem and some extended features that come in handy for many classification problems. For an easy start with caret take a look at one of the many presentations and intros to caret (like this one by Max Kuhn, maintainer of caret).

Desirable features

  • Hyperparameter tuning using grid search
  • Multicore processing for speedup
  • Weighting of samples to compensate for unequal class sizes (not possible with all classifiers, but possible with SVM)
  • Classifier outputting not only predicted classes but prediction probabilities. These can be used for extended performance measures (e.g. ROC curves)


We use 2 out of the 3 classes from the standard R iris dataset (the versicolor and virginica classes). We further restrict ourselves to use only the first two features (Sepal.Length and  Sepal.Width) to make well performing classification a bit more difficult. If we would use all features classification would be perfect, therefore performance measures would be rather boring.

> head(iris[order(iris$Species, decreasing = T),])

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica
106          7.6         3.0          6.6         2.1 virginica


We first split our data into a training and a test partition. Training data will be used for cross validation (CV) hyperparameter tuning and determining+building a model from the best hyperparameters. Test data will be used exclusively to do performance measures. Second, we create weights for our data samples so that each sample is weighted according to it’s corresponding class size (compensates for smaller and bigger class sizes). Samples of bigger classes will be assigned smaller weights and vice versa. As in our problem classes are exactly of equal size, weights will be equal for all samples too.

Next we search for best hyperparameters and create the according model using caret train. For preprocessing we let train auto scale and center our training data. There exist many other, well known and well working approaches for preprocessing, even implemented in caret (e.g. using PCA or IDA) – but we leave them out for the moment. For the parameter grid to be searched we use exponential parameter series. This way the level of detail gets adjusted along with the parameter size (be careful where to arrange for more and less details) and huge parameter spaces can be covered (with coarse detail) with the first run already. For cross validation we use 10 partitions (performance measures internally of caret will be computed as average partition performance). We allow parallel processing – and request train to create the classification model so that it computes prediction probabilities along with predicted classes.

After having found our model parameters and having created our model we let it predict our test data classes. We do that a) with predicting classes to obtain a confusion matrix and b) with prediction probabilities to create a ROC curve. For the ROC curve we have to define one of our classes as the positive class (we choose versicolor here) and the other (virginica) as the negative class. Finally, we extract the ROC curve points of equal error rate (EER), maximum accuracy, maximum kappa and minimum squared error rate (MSER). They all together give you an idea of how well your model is performing on new data.

Script code

# Example of SVM hyperparameter tuning and performance measures for a binary class problem with the caret package
# Rainhard Findling
# 2014 / 07

# optional - enable multicore processing

# create binary problem: only use two classes
x <- iris[,1:2][iris$Species %in% c('versicolor', 'virginica'),]
y <- factor(iris[,5][iris$Species %in% c('versicolor', 'virginica')])

# we see that data will probably not be perfectly seperable using linear separation
featurePlot(x = x, y = y)

# split into train and test data. train = cv parameter grid search and model creation, test = performance analysis
indexes_y_test <- createDataPartition(y = 1:length(y), times = 1, p = 0.3)[[1]]

# creation of weights - also fast for very big datasets
weights <- as.numeric(y[-indexes_y_test])
for(val in unique(weights)) {weights[weights==val]=1/sum(weights==val)*length(weights)/2} # normalized to sum to length(samples)

model <- train(method = 'svmLinear', 
           x = x[-indexes_y_test,], 
           y = y[-indexes_y_test], 
           weights = weights,
           maximize = T,
           tuneGrid = expand.grid(.C=3^(-15:15)),   
           preProcess = c('center', 'scale'),
           trControl = trainControl(method = 'cv', # cross validation
                                    number = 10,   # nr of cv sets
#                                     repeats = 5, # use with method=repeatcv
                                    returnResamp = 'none', # return accuracy per cv partition and parameter setting
                                    classProbs = T, # return prediction probabilities along with predicted classes
#                                     savePredictions=T, # returns all predictions (for all cv paritions) for each tuneGrid parameter set 
                                    returnData = F, # disable return of training data e.g. for big data sets
                                    allowParallel = T
# we see some accuracy around 0.7-0.8
head(with(model, results[order(results$Kappa, decreasing=T),]))
# confusion matrix: model predicting classes of test data
table(predict.train(object = model, newdata = x[indexes_y_test,], type='raw'), y[indexes_y_test])
# prediction probabilities of test data classes
probs <- predict.train(object = model, newdata = x[indexes_y_test,], type='prob')[,1]
isPositiveClass <- y[indexes_y_test] == 'versicolor' # for a ROC curve there is a positive class (true match rate...) - defining that class here
pred <- prediction(probs, isPositiveClass)
perf <- performance(pred, 'tpr', 'fpr')
# plot: either
plot(perf, lwd=2, col=3)
# or
with(attributes(perf), plot(x=x.values[[1]], y=y.values[[1]], type='l')) 

# some metrics: AUC (area under curve), EER (equal error rate), MSER (minimum squared error rate), Cohen's Kappa etc.
AUC <- attributes(performance(pred, 'auc'))$y.value[[1]] # area under curve
df <- with(attributes(pred), data.frame(cutoffs=cutoffs[[1]], tp=tp[[1]], fn=fn[[1]], tn=tn[[1]], fp=fp[[1]], TMR=tp[[1]]/(tp[[1]]+fn[[1]]), TNMR=tn[[1]]/(tn[[1]]+fp[[1]])))
df$MSER <- with(df, sqrt((1-TMR)**2+(1-TNMR)**2))
MSER <- with(df, sqrt(TMR**2+TNMR**2)) # sqrt of minimum squared error rate = eucl. distance to point TMR=TNMR=1
i_eer <- with(df, which.min(abs(TMR-TNMR)))
EER <- with(df[i_eer,], mean(c(TMR,TNMR))) # equal error rate: mean would not be required when using ROCR as it's always exact
df$acc <- with(df, (tp + tn) / length(isPositiveClass)) # observed accuracy
df$acc_expected <- with(df, sum(isPositiveClass) * (tp+fp) / length(isPositiveClass) + sum(!isPositiveClass) * (tn+fn) / length(isPositiveClass)) / length(isPositiveClass) # expected accuracy
df$kappa <- with(df, (acc - acc_expected) / (1 - acc_expected)) # cohen's kappa
# graphical representation 
matplot(df[,6:11], lty=1:6, lwd=2, type='l'); legend('bottomright', legend=names(df[,6:11]), lty=1:6, lwd=2, col=1:6)

# characteristics for settings with best EER, MSER etc.
df[i_eer,] # curve point for equal error rate
df[order(df$acc, decreasing=T)[[1]],] # curve point for max accuracy
df[order(df$kappa, decreasing=T)[[1]],] # curve point for max kappa
df[order(df$MSER, decreasing=F)[[1]],] # curve point for minimum squared error rate


With a test run the confusion matrix for the test data set was

             versicolor virginica
  versicolor         14         6
  virginica           2        10

which points out that 2 out of 16 versicolor samples were incorrectly classified as virginica and 6 out of 16 virginica samples were incorrectly classified as versicolor. Based on prediction probabilities we generated a ROC curve. The true match rate (TMR) indicates the rate of how much test data set samples were correctly identified as being of the positive class (in our case: versicolor). The false match rate (FMR, counterpart to true non match rate TNMR) indicates the rate of how much test data set samples were incorrectly identified as being of the positive class.rocAlthough ROC curves indicate classification performance well it’s hard to compare classifiers based directly on ROC curves. Here, different other metrics come in handy, such as the area under curve (AUC), equal error rate (EER), maximum accuracy, maximum kappa and minimum squared error rate (MSER = square of minimum euclidean distance between ROC curve and top left corner for TMR = TNMR = 1). In our case the AUC is about 0.8672, the other metrics (and their corresponding ROC curve points) are about:

             cutoffs   TMR TNMR accuracy accuracy_expected kappa  MSER
EER            0.550 0.750 0.75     0.75               0.5   0.5 0.354
Max acc.       0.515 0.875 0.75   0.8125               0.5 0.625 0.280
Max kappa      0.515 0.875 0.75   0.8125               0.5 0.625 0.280
Min MSER       0.515 0.875 0.75   0.8125               0.5 0.625 0.280

mctune: multicore hyperparameter tuning in R on the example of SVM car detection


In Machine Learning (ML) tasks finding good hyperparameters for machine learning models is critical (hyperparameter optimization). In R there exist some packages containing routines doing that for you using grid search (constructing and testing all possible parameters as a grid, e.g. in David Meyer’s e1071 package).

Besides the very good routines already contained in those packages a while ago I liked to have multicore hyperparameter tuning and needed true positive rate (true match rate) and true negative rate (true non match rate) alongside overall error rate. Therefore I modified the tune method of the e1071 package to a) use multicore processing and b) return confusion matrizes showing average true positive / true negative rates per sample class and parameter setting.

The script inherits it’s GPL-2 licence from it’s original version. I modified tune specifically for my own needs when tuning SVM hyperparameters, therefore did not validate it for correctness or how it will work with other tune-able functions (e.g. KNN, random forest).

# modified version of "tune" from package "e1071"
# licence: GPL-2
# details:
mctune <- function(method, train.x, train.y = NULL, data = list(),
                 validation.x = NULL, validation.y = NULL,
                 ranges = NULL, predict.func = predict,
                 tunecontrol = tune.control(),
) {
  call <-


  ## internal helper functions
  resp <- function(formula, data) {
    model.response(model.frame(formula, data))

  classAgreement2 <- function (tab) {
    n <- sum(tab)
    # correct classification rate
    if (!is.null(dimnames(tab))) {
      lev <- intersect(colnames(tab), rownames(tab))
      d <- diag(tab[lev, lev])
      p0 <- sum(d) / n
    } else {
      m <- min(dim(tab))
      d <- diag(tab[1:m, 1:m])
      p0 <- sum(d) / n
    # confusion matrizes
    if(!confusionmatrizes) {
    } else if(is.null(dimnames(tab))) {
      stop('tables without dimension names are not allowed when generating confusionmatrizes.')
    else {
      # generate confusion matrix for each class
      classnames <- unique(unlist(dimnames(tab)))
      truepositives <-  unlist(lapply(classnames, function(positiveclassname) { sum(d[positiveclassname])}))
      falsepositives <- unlist(lapply(classnames, function(positiveclassname) { sum(tab[positiveclassname,])-tab[positiveclassname,positiveclassname]}))
      falsenegatives <- unlist(lapply(classnames, function(positiveclassname) { sum(tab[,positiveclassname])-tab[positiveclassname,positiveclassname]}))
      truenegatives <- mapply(FUN=function(tp,fn,fp){ sum(tab)-tp-fn-fp }, truepositives, falsenegatives, falsepositives)
      confusions <- data.frame(classnames, truepositives, truenegatives, falsepositives, falsenegatives, row.names=NULL)
      colnames(confusions) <- c('class', 'tp', 'tn', 'fp', 'fn')
      list(p0=p0, confusions=confusions)

  ## parameter handling
  if (tunecontrol$sampling == "cross")
    validation.x <- validation.y <- NULL
  useFormula <- is.null(train.y)
  if (useFormula && (is.null(data) || length(data) == 0))
    data <- model.frame(train.x)
  if (is.vector(train.x)) train.x <- t(t(train.x))
  if (
    train.y <- as.matrix(train.y)

  ## prepare training indices
  if (!is.null(validation.x)) tunecontrol$fix <- 1
  n <- (nrow(if (useFormula) data
            else train.x))
  perm.ind <- sample(n)
  if (tunecontrol$sampling == "cross") {
    if (tunecontrol$cross > n)
      stop(sQuote("cross"), " must not exceed sampling size!")
    if (tunecontrol$cross == 1)
      stop(sQuote("cross"), " must be greater than 1!")
  train.ind <- (if (tunecontrol$sampling == "cross")
                  tapply(1:n, cut(1:n, breaks = tunecontrol$cross), function(x) perm.ind[-x])
                else if (tunecontrol$sampling == "fix")
                  list(perm.ind[1:trunc(n * tunecontrol$fix)])
                ## bootstrap
                         function(x) sample(n, n * tunecontrol$boot.size, replace = TRUE))

  ## find best model
  parameters <- (if(is.null(ranges))
                  data.frame(dummyparameter = 0)
  p <- nrow(parameters)
  if (!is.logical(tunecontrol$random)) {
    if (tunecontrol$random < 1)
      stop("random must be a strictly positive integer")
    if (tunecontrol$random > p) tunecontrol$random <- p
    parameters <- parameters[sample(1:p, tunecontrol$random),]

  ## - loop over all models
  # concatenate arbitrary mc-arguments with explicit X and FUN arguments
  train_results<"mclapply", args=c(mc.control, list(X=1:p, ..., FUN=function(para.set) {
    sampling.errors <- c()
    sampling.confusions <- c()

    ## - loop over all training samples
    for (sample in 1:length(train.ind)) {
      repeat.errors <- c()
      repeat.confusions <- c()

      ## - repeat training `nrepeat' times
      for (reps in 1:tunecontrol$nrepeat) {

        ## train one model
        pars <- if (is.null(ranges))
          lapply(parameters[para.set,,drop = FALSE], unlist)

        model <- if (useFormula)
, c(list(train.x,
                                 data = data,
                                 subset = train.ind[[sample]]),
                            pars, list(...)
, c(list(train.x[train.ind[[sample]],],
                                 y = train.y[train.ind[[sample]]]),
                            pars, list(...)

        ## predict validation set
        pred <- predict.func(model,
                             if (!is.null(validation.x))
                             else if (useFormula)
                               data[-train.ind[[sample]],,drop = FALSE]
                             else if (inherits(train.x, "matrix.csr"))
                               train.x[-train.ind[[sample]],,drop = FALSE]

        ## compute performance measure
        true.y <- if (!is.null(validation.y))
        else if (useFormula) {
          if (!is.null(validation.x))
            resp(train.x, validation.x)
            resp(train.x, data[-train.ind[[sample]],])
        } else

        if (is.null(true.y)) true.y <- rep(TRUE, length(pred))

        if (!is.null(tunecontrol$
          repeat.errors[reps] <- tunecontrol$, pred)
        else if ((is.logical(true.y) || is.factor(true.y)) && (is.logical(pred) || is.factor(pred) || is.character(pred))) { ## classification error
          l <- classAgreement2(table(pred, true.y))
          repeat.errors[reps] <- (1 - l$p0) # wrong classification rate
          if(confusionmatrizes) {
            repeat.confusions[[reps]] <- l$confusions
        } else if (is.numeric(true.y) && is.numeric(pred)) ## mean squared error
          repeat.errors[reps] <- crossprod(pred - true.y) / length(pred)
          stop("Dependent variable has wrong type!")
      sampling.errors[sample] <- tunecontrol$repeat.aggregate(repeat.errors)
      # TODO potentially implement separate aggregation of tp tn fp fn values. currently those are taken with correlate to the least error.
      if(confusionmatrizes) {
        sampling.confusions[[sample]] <- repeat.confusions[repeat.errors == sampling.errors[sample]][[1]]
    # TODO potentially implement separate aggregation of tp tn fp fn values. currently uses the same as for error / variance aggregation
    if(!confusionmatrizes) {
    } else {
      # create one confusion data frame
      confusions <- ldply(sampling.confusions, data.frame)
      # calculate aggregate / disperse values per class
      confusions <- ldply(lapply(X=split(confusions, confusions$class), FUN=function(classdf) {
        # only take numeric values
        # calculate aggregate / disperse values for this class
        aggregated <- apply(X=classdf[,c('tp','tn','fp','fn')], MAR=2, FUN=tunecontrol$sampling.aggregate)
        dispersions <- apply(X=classdf[,c('tp','tn','fp','fn')], MAR=2, FUN=tunecontrol$sampling.dispersion)
        # make 1 row dataframe out of it (combine rows later with outer ldply)
      }), data.frame)
      colnames(confusions) <- c('class', 'tp.value', 'tn.value', 'fp.value', 'fn.value', 'tp.dispersion', 'tn.dispersion', 'fp.dispersion', 'fn.dispersion')
      # calculate mean confusion matrix values (mean of all classes) for best model
      confusions.mean <- data.frame(t(apply(X=confusions[,c('tp.value','tn.value','fp.value','fn.value','tp.dispersion','tn.dispersion','fp.dispersion','fn.dispersion')], MAR=2, FUN=mean)))
      colnames(confusions.mean) <- c('tp.value', 'tn.value', 'fp.value', 'fn.value', 'tp.dispersion', 'tn.dispersion', 'fp.dispersion', 'fn.dispersion')
#   print('mctune: mclapply done.')
#   print(train_results)
  model.errors <- unlist(lapply(train_results,function(x)x$model.error))
  model.variances <- unlist(lapply(train_results,function(x)x$model.variance))
    model.confusions <- lapply(train_results,function(x)x$model.confusions)
    model.confusions.mean <- ldply(lapply(train_results,function(x)x$model.confusions.mean), data.frame)

  ## return results
  best <- which.min(model.errors)
  pars <- if (is.null(ranges))
    lapply(parameters[best,,drop = FALSE], unlist)
  structure(list(best.parameters  = parameters[best,,drop = FALSE],
                 best.performance = model.errors[best],
                 method           = if (!is.character(method))
                   deparse(substitute(method)) else method,
                 nparcomb         = nrow(parameters),
                 train.ind        = train.ind,
                 sampling         = switch(tunecontrol$sampling,
                                           fix = "fixed training/validation set",
                                           bootstrap = "bootstrapping",
                                           cross = if (tunecontrol$cross == n) "leave-one-out" else
                                             paste(tunecontrol$cross,"-fold cross validation", sep="")
                 performances     = if (tunecontrol$performances) cbind(parameters, error = model.errors, dispersion = model.variances),
                 confusionmatrizes = if (confusionmatrizes) model.confusions,
                 confusionmatrizes.mean = if(confusionmatrizes) model.confusions.mean,
                 best.confusionmatrizes = if(confusionmatrizes) model.confusions[[best]],
                 best.confusionmatrizes.mean = if(confusionmatrizes) model.confusions.mean[best,],
                 best.model       = if (tunecontrol$best.model) {
                   modeltmp <- if (useFormula)
           , c(list(train.x, data = data),
                                       pars, list(...)))
           , c(list(x = train.x,
                                            y = train.y),
                                       pars, list(...)))
                   call[[1]] <- as.symbol("best.tune")
                   modeltmp$call <- call
            class = "tune"

Additional parameters


A list of parameters that go to mclapply from the parallel package. Example: mc.control=list(mc.cores=3, mc.preschedule=F)


Binary flag indicating if confusion matrizes should be generated or not.

Additional return values


List of confusion matrizes. The matrizes are sorted the same way as tune$performances is. Each matrix is a data frame listing the sample classes (class), it’s corresponding absolute average values (value) and standard deviation (dispersion) for true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). An pre-drawn example of such a confusionmatrix from our car detection below could look like this (where “yes” means “car” and “no” means “no car”):

class tp.value tn.value fp.value fn.value tp.dispersion tn.dispersion fp.dispersion fn.dispersion
   no     49.5     53.4      1.6      0.5      5.212165      5.621388     0.8432740     0.5270463
  yes     53.4     49.5      0.5      1.6      5.621388      5.212165     0.5270463     0.8432740

In above example for the “yes” class the tp.value=53.4 means that on average 53.4 samples were correctly classified as “yes” using the corresponding parameter setting.

Car detection

As ML example we do a car detection. The task is to decide if a given image contains or not contains a car (therefore leaving out searching mechanisms like sliding window). As data source we use the UIUC Image Database for Car Detection. In order to reduce the amount of features (pixels) and “blur” images (which discards unnecessary details and likely even increases recognition probabilities) I resized the images from original 100×40 to 50×20 pixels. For a human it’s still easy to decide if an image shows a car or not at this scale:

No car 1
No car 1
No car 2
No car 2
No car 3
No car 3
Car 1
Car 1
Car 2
Car 2
Car 3
Car 3



As ML model we use a support vector machine with radial (Gaussian) kernel, which leaves us with the cost and gamma parameter to be tuned – which we do using a grid search and 10 fold cross validation.

# load data
print('loading data...')
car_neg <- lapply(dir(pattern="neg.*png"),readPNG)
car_pos <- lapply(dir(pattern="pos.*png"),readPNG)
data <- c(car_neg,car_pos)
labels <- factor(c(rep('no',length(car_neg)), rep('yes', length(car_pos))))
# put data to dataframe
data <- t(array(unlist(data),dim=c(length(data[[1]]),length(data))))
# look at a car
# image(array(as.numeric(data[600,]),dim=c(20,50)))
# look at correlation between first pixels
# plot(data.frame(data,labels)[,1:8],pch='.') 

          mc.control=list(mc.cores=3, mc.preschedule=F),
          validation.x=NULL, #validation.x and .y are only used with tune.control sampling='boot' and 'fixed'.
# extract FMR FNMR from our positive class
p <- lapply(X=t$confusionmatrizes, FUN=function(x){
  p <- x[x$class=='yes',]
p <- ldply(p)
t$performances <- data.frame(t$performances, p)
t$performances$FMR <- t$performances$fp.value / (t$performances$tn.value + t$performances$fp.value)
t$performances$FNMR <- t$performances$fn.value / (t$performances$tp.value + t$performances$fn.value)
# print list of errors
t$performances[with(t$performances, order(error)),]

# different plots of parameters and errors
scatterplot3d(t$performances$cost, t$performances$gamma, t$performances$error,log='xy', type='h', pch=1, color='red')
# paramters and errors: from best to worst
plot(t$performances[with(t$performances, order(error)),]$cost,log='y', col='blue')
points(t$performances[with(t$performances, order(error)),]$gamma, col='green')
points(t$performances[with(t$performances, order(error)),]$error, col='black')
# points(t$performances[with(t$performances, order(error)),]$FMR, col='red')
# points(t$performances[with(t$performances, order(error)),]$FNMR, col='orange')

Using this setup with the first search we obtain an error rate of about 0.02 for gamma 10^-3 and cost >= 3 (this indicates the search area for the next step). The corresponding false match rate (FMR) and false non match rate (FNMR) are in the range of 0.01 and 0.029.

Image classification using SVMs in R

Recently I did some Support Vector Machine (SVM) tests in R (statistical language with functional parts for rapid prototyping and data analysis — somehow similar to Matlab, but open source ;)) for my current face recognition projects. To get my SVMs up and running in R, using image data as in- and output, I wrote a small demo script for classifying images. As test data I used 2 classes of images (lines from left top to right bottom and lines from left bottom to right top), with 10 samples each — like these:

The complete image set is available here.

For SVM classification simple train and test sets get used — for more sophisticated problems n-fold cross validation for searching good parameter settings is recommended instead. For everybody who did not yet work with SVMs, I’d recommend reading something about how to start with “good” SVM classification, like the pretty short and easy to read “A Practical Guide to Support Vector Classification” from Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (the LIBSVM-inventors).

Update: added parallel processing using parallel and mclapply for loading image data (for purpose of demonstration only, loading 10 images in parallel does not make a big difference ;)).

print('starting svm demo...')


# load img data
file_list <- dir(folder, pattern="png")
data <- mclapply(file_list, readPNG, mc.cores=2)
# extract subject id + img nr from names
subject_ids <- lapply(file_list, function(file_name) as.numeric(unlist(strsplit(file_name, "_"))[1]))
# rename subject id's to c1 and c2 for more clear displaying of results
img_ids <- lapply(file_list, function(file_name) as.numeric(unlist(strsplit(unlist(strsplit(file_name, "_"))[2], "\\."))[1]))

# specify which data should be used as test and train by the img nrs
train_test_border <- 7
# split data into train and test, and bring into array form to feed to svm
train_in <- t(array(unlist(data[img_ids < train_test_border]), dim=c(length(unlist(data[1])),sum(img_ids < train_test_border))))
train_out <- unlist(subject_ids[img_ids < train_test_border])
test_in <- t(array(unlist(data[img_ids >= train_test_border]), dim=c(length(unlist(data[1])),sum(img_ids >= train_test_border))))
test_out <- unlist(subject_ids[img_ids >= train_test_border])

# train svm - try out different kernels + settings here
svm_model <- svm(train_in, train_out, type='C', kernel='linear')

# evaluate svm
p <- predict(svm_model, train_in)
print(table(p, train_out))
p <- predict(svm_model, test_in)
print(table(p, test_out))

print('svm demo done!')