Archive

Archive for the ‘Data Analysis’ Category

Install R from source: a simple way to compile R from source and use it in RStudio or on servers without X

Sometimes we need to use special versions of R (e.g. the latest version of R is not available in the repositories of the machine we need to use). This post describes a simple way of compiling R from sources and using it in RStudio or on servers without X.

Get latest R source

Get the source of your desired R version from https://www.r-project.org/. You want to obtain a file named R.x.y.z.tar from there and untar it using tar -xf R-x.y.z.tar.gz. Change into the untar-ed folder then.

More details on this are available at https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Getting-and-unpacking-the-sources

Build R from source for RStudio

Built R using the following commands. Be aware that there are a number of prerequisites to this compilation, like having gcc and g++ installed.

./configure --enable-R-shlib
make
make check
make check-all

 

Passing the checks creates the R and Rscript frontends in ./bin/ — those are what you most likely need. You can call these directly, link to them, or use them in RStudio (see below). If you forgot to add the –enable-R-shlib parameter to configure you need to delete and re-extract the tar before redoing the process — otherwise make will fail.

More details on this are available at https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Simple-compilation

Define the R to use with RStudio

RStudio uses the R defined by which R. Hence we can add the following to the ~/.profile file to define which R is used by RStudio:

# export RSTUDIO_WHICH_R=/path/to/your/R # this is what RStudio usually uses
export RSTUDIO_WHICH_R=/usr/local/bin/R  # this is what we use
. ~/.profile # reloads the profile without requiring us to logout+login again

More details on this are available at https://support.rstudio.com/hc/en-us/articles/200486138-Using-Different-Versions-of-R

Build R for environments without X/without Java

Do the same steps as above, but use this configure instead:

./configure --with-x=no --disable-java

More details on this are available at https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Essential-programs-and-libraries

Hint for “package not available for R version x.z.y…” errors

If package installations fail because your repositories don’t contain the required SW, try the Berkeley mirror: from our experience, they host lots of packages in many versions. For example:

install.packages("ggplot2", repos="http://cran.cnr.berkeley.edu")

Alternatively, the URL to the source archive can be specified directly:

packageurl <- "http://cran.r-project.org/src/contrib/Archive/XXXX/XXXX_A.B.C.tar.gz"
install.packages(packageurl, contriburl=NULL, type="source")

More details on this are available at http://stackoverflow.com/questions/17998543/what-to-do-when-a-package-is-not-available-for-our-r-version

Yet another option is to use devtools to fetch code e.g. directly from GitHub:

devtools::install_github("hadley/lineprof")

More details on this example can be found at http://adv-r.had.co.nz/Profiling.html

 

Categories: Data Analysis Tags: , , , , , , ,

Weka on Android: load precomputed model and predict new samples

November 27, 2016 Leave a comment

Weka on Android

Imagine you want to use machine learning on Android to predict the value for your target variable for new samples. Usually, neither the storage capabilities (for storing the training data) nor the computational power for training such models is available on mobile devices – assuming that you use a medium or large dataset and thorough evaluation of different model types and model parametrizations with a subsequent model selection. In such a scenario, the way to go therefore is:

  1. Train, evaluate, and select the optimal model for your machine learning prediction offline, e.g. using a desktop PC or server hardware.
  2. Export the ready trained model (“model file”, which usually tends to be relatively small) and ship it with your Android application.
  3. The “application case”/”production case” on the mobile device simply involves loading and using the model to predict new, yet unseen samples. Thereby, the prediction on the mobile device requires much less computational power and storage capabilities than the previous training of the model.

Based on the previous post about loading and using Weka models with JavaSE, this example demonstrates how to load and use a model Android which was previously trained with Weka “offline”, aka on a desktop PC. The core points you have to look out for when setting up the Android project:

  1. Use the exact same version of Weka you used for offline training the model also on Android.
  2. Copy the weka.jar to libs/, then select “Add as library…”.
  3. Ensure that “compile files(’libs/weka.jar’)” is present in build.gradle.
  4. Do a clean build: at the first time, this is going to require some CPU power and take some time.
  5. Don’t do anything with Weka that is UI related: Weka uses the JavaSE UI, which is not available on Android.
  6. Follow the details in the example below:
    1. Structure the new, yet unseen instance you want to predict the target variable so that Weka can work with it.
    2. Load the classifier object.
    3. Predict the target variable for the new instance.

The complete Android example project containing the code below can be downloaded here:

  • Download (7z): Android Weka Example.
  • SHA256sum: aaf8594954d3561218dc0dd621740f3184a171efa7016a2efe4b743a19845ffb
package at.fhooe.mcm.ml.androidwekaexample;

import android.content.res.AssetManager;
import android.os.Bundle;
import android.support.v7.app.AppCompatActivity;
import android.support.v7.widget.Toolbar;
import android.util.Log;
import android.view.View;
import android.view.Menu;
import android.view.MenuItem;
import android.widget.TextView;
import android.widget.Toast;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;

import ml.mcm.fhooe.at.androidwekaexample.R;
import weka.classifiers.Classifier;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;

public class MainActivity extends AppCompatActivity {

    private static final String WEKA_TEST = "WekaTest";

    private Random mRandom = new Random();

    private Sample[] mSamples = new Sample[]{
            new Sample(1, 0, new double[]{5, 3.5, 2, 0.4}), // should be in the setosa domain
            new Sample(2, 1, new double[]{5.6, 3, 3.5, 1.2}), // should be in the versicolor domain
            new Sample(3, 2, new double[]{7, 3, 6.8, 2.1}) // should be in the virginica domain
    };

    private Classifier mClassifier = null;

    TextView mTextViewSamples = null;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        Toolbar toolbar = (Toolbar) findViewById(R.id.toolbar);
        setSupportActionBar(toolbar);

        // show samples
        StringBuilder sb = new StringBuilder("Samples:\n");
        for(Sample s : mSamples) {
            sb.append(s.toString() + "\n");
        }
        mTextViewSamples = (TextView) findViewById(R.id.textViewSamples);
        mTextViewSamples.setText(sb.toString());

        Log.d(WEKA_TEST, "onCreate() finished.");
    }

    @Override
    public boolean onCreateOptionsMenu(Menu menu) {
        // Inflate the menu; this adds items to the action bar if it is present.
        getMenuInflater().inflate(R.menu.menu_main, menu);
        return true;
    }

    @Override
    public boolean onOptionsItemSelected(MenuItem item) {
        // Handle action bar item clicks here. The action bar will
        // automatically handle clicks on the Home/Up button, so long
        // as you specify a parent activity in AndroidManifest.xml.
        int id = item.getItemId();

        //noinspection SimplifiableIfStatement
        if (id == R.id.action_settings) {
            return true;
        }

        return super.onOptionsItemSelected(item);
    }

    public void onClickButtonLoadModel(View _v) {
        Log.d(WEKA_TEST, "onClickButtonLoadModel()");

        AssetManager assetManager = getAssets();
        try {
            mClassifier = (Classifier) weka.core.SerializationHelper.read(assetManager.open("iris_model_logistic_allfeatures.model"));

        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            // Weka "catch'em all!"
            e.printStackTrace();
        }
        Toast.makeText(this, "Model loaded.", Toast.LENGTH_SHORT).show();
    }

    public void onClickButtonPredict(View _v) {
        Log.d(WEKA_TEST, "onClickButtonPredict()");

        if(mClassifier==null){
            Toast.makeText(this, "Model not loaded!", Toast.LENGTH_SHORT).show();
            return;
        }

        // we need those for creating new instances later
        // order of attributes/classes needs to be exactly equal to those used for training
        final Attribute attributeSepalLength = new Attribute("sepallength");
        final Attribute attributeSepalWidth = new Attribute("sepalwidth");
        final Attribute attributePetalLength = new Attribute("petallength");
        final Attribute attributePetalWidth = new Attribute("petalwidth");
        final List<String> classes = new ArrayList<String>() {
            {
                add("Iris-setosa"); // cls nr 1
                add("Iris-versicolor"); // cls nr 2
                add("Iris-virginica"); // cls nr 3
            }
        };

        // Instances(...) requires ArrayList<> instead of List<>...
        ArrayList<Attribute> attributeList = new ArrayList<Attribute>(2) {
            {
                add(attributeSepalLength);
                add(attributeSepalWidth);
                add(attributePetalLength);
                add(attributePetalWidth);
                Attribute attributeClass = new Attribute("@@class@@", classes);
                add(attributeClass);
            }
        };
        // unpredicted data sets (reference to sample structure for new instances)
        Instances dataUnpredicted = new Instances("TestInstances",
                attributeList, 1);
        // last feature is target variable
        dataUnpredicted.setClassIndex(dataUnpredicted.numAttributes() - 1);

        // create new instance: this one should fall into the setosa domain
        final Sample s = mSamples[mRandom.nextInt(mSamples.length)];
        DenseInstance newInstance = new DenseInstance(dataUnpredicted.numAttributes()) {
            {
                setValue(attributeSepalLength, s.features[0]);
                setValue(attributeSepalWidth, s.features[1]);
                setValue(attributePetalLength, s.features[2]);
                setValue(attributePetalWidth, s.features[3]);
            }
        };
        // reference to dataset
        newInstance.setDataset(dataUnpredicted);

        // predict new sample
        try {
            double result = mClassifier.classifyInstance(newInstance);
            String className = classes.get(new Double(result).intValue());
            String msg = "Nr: " + s.nr + ", predicted: " + className + ", actual: " + classes.get(s.label);
            Log.d(WEKA_TEST, msg);
            Toast.makeText(this, msg, Toast.LENGTH_SHORT).show();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public class Sample {
        public int nr;
        public int label;
        public double [] features;

        public Sample(int _nr, int _label, double[] _features) {
            this.nr = _nr;
            this.label = _label;
            this.features = _features;
        }

        @Override
        public String toString() {
            return "Nr " +
                    nr +
                    ", cls " + label +
                    ", feat: " + Arrays.toString(features);
        }
    }
}
Categories: Data Analysis

Weka in Java: Predict new samples using a precomputed and exported model object

October 2, 2016 1 comment

weka

Weka allows for exporting/saving computed models into a binary model file (usually having a “.model” file extension). On other machines (that e.g. don’t have access to training data or would lack the computational power for training the model themselves), those model files can be loaded and used to predict new samples. The following example in Java highlights two aspects of this process, which are a) loading such a model object and b) how to format a new, yet unclassified instance so that its target variable can be predicted using the loaded model:

import java.util.ArrayList;
import java.util.List;

import weka.classifiers.Classifier;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;

public class Main {

	public static void main(String[] args) {
		new Main().main();
	}

	public void main() {

		// we need those for creating new instances later
		final Attribute attributeSepalLength = new Attribute("sepallength");
		final Attribute attributeSepalWidth = new Attribute("sepalwidth");
		final Attribute attributePetalLength = new Attribute("petallength");
		final Attribute attributePetalWidth = new Attribute("petalwidth");
		final List<String> classes = new ArrayList<String>() {
			{
				add("Iris-setosa");
				add("Iris-versicolor");
				add("Iris-virginica");
			}
		};

		// Instances(...) requires ArrayList<> instead of List<>...
		ArrayList<Attribute> attributeList = new ArrayList<Attribute>(2) {
			{
				add(attributeSepalLength);
				add(attributeSepalWidth);
				add(attributePetalLength);
				add(attributePetalWidth);
				Attribute attributeClass = new Attribute("@@class@@", classes);
				add(attributeClass);
			}
		};
		// unpredicted data sets (reference to sample structure for new instances)
		Instances dataUnpredicted = new Instances("TestInstances",
				attributeList, 1);
		// last feature is target variable
		dataUnpredicted.setClassIndex(dataUnpredicted.numAttributes() - 1); 

		// create new instance: this one should fall into the setosa domain
		DenseInstance newInstanceSetosa = new DenseInstance(dataUnpredicted.numAttributes()) {
			{
				setValue(attributeSepalLength, 5);
				setValue(attributeSepalWidth, 3.5);
				setValue(attributePetalLength, 2);
				setValue(attributePetalWidth, 0.4);
			}
		};
		// create new instance: this one should fall into the virginica domain
		DenseInstance newInstanceVirginica = new DenseInstance(dataUnpredicted.numAttributes()) {
			{
				setValue(attributeSepalLength, 7);
				setValue(attributeSepalWidth, 3);
				setValue(attributePetalLength, 6.8);
				setValue(attributePetalWidth, 2.1);
			}
		};
		
		// instance to use in prediction
		DenseInstance newInstance = newInstanceSetosa;
		// reference to dataset
		newInstance.setDataset(dataUnpredicted);

		// import ready trained model
		Classifier cls = null;
		try {
			cls = (Classifier) weka.core.SerializationHelper
					.read("/home/pirius/iris_model_logistic_allfeatures.model");
		} catch (Exception e) {
			e.printStackTrace();
		}
		if (cls == null)
			return;

		// predict new sample
		try {
			double result = cls.classifyInstance(newInstance);
			System.out.println("Index of predicted class label: " + result + ", which corresponds to class: " + classes.get(new Double(result).intValue()));
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}
Categories: Data Analysis

SVM classification example with performance measures using R caret

The caret package (short for Classification And REgression Training)

The caret package (short for Classification And REgression Training)

This example is a followup of hyperparameter tuning using the e1071 package in R. This time we’re using the SVM implementation from the R caret package, a binary class classification problem and some extended features that come in handy for many classification problems. For an easy start with caret take a look at one of the many presentations and intros to caret (like this one by Max Kuhn, maintainer of caret).

Desirable features

  • Hyperparameter tuning using grid search
  • Multicore processing for speedup
  • Weighting of samples to compensate for unequal class sizes (not possible with all classifiers, but possible with SVM)
  • Classifier outputting not only predicted classes but prediction probabilities. These can be used for extended performance measures (e.g. ROC curves)

Dataset

We use 2 out of the 3 classes from the standard R iris dataset (the versicolor and virginica classes). We further restrict ourselves to use only the first two features (Sepal.Length and  Sepal.Width) to make well performing classification a bit more difficult. If we would use all features classification would be perfect, therefore performance measures would be rather boring.

> head(iris[order(iris$Species, decreasing = T),])

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica
106          7.6         3.0          6.6         2.1 virginica

Concepts

We first split our data into a training and a test partition. Training data will be used for cross validation (CV) hyperparameter tuning and determining+building a model from the best hyperparameters. Test data will be used exclusively to do performance measures. Second, we create weights for our data samples so that each sample is weighted according to it’s corresponding class size (compensates for smaller and bigger class sizes). Samples of bigger classes will be assigned smaller weights and vice versa. As in our problem classes are exactly of equal size, weights will be equal for all samples too.

Next we search for best hyperparameters and create the according model using caret train. For preprocessing we let train auto scale and center our training data. There exist many other, well known and well working approaches for preprocessing, even implemented in caret (e.g. using PCA or IDA) – but we leave them out for the moment. For the parameter grid to be searched we use exponential parameter series. This way the level of detail gets adjusted along with the parameter size (be careful where to arrange for more and less details) and huge parameter spaces can be covered (with coarse detail) with the first run already. For cross validation we use 10 partitions (performance measures internally of caret will be computed as average partition performance). We allow parallel processing – and request train to create the classification model so that it computes prediction probabilities along with predicted classes.

After having found our model parameters and having created our model we let it predict our test data classes. We do that a) with predicting classes to obtain a confusion matrix and b) with prediction probabilities to create a ROC curve. For the ROC curve we have to define one of our classes as the positive class (we choose versicolor here) and the other (virginica) as the negative class. Finally, we extract the ROC curve points of equal error rate (EER), maximum accuracy, maximum kappa and minimum squared error rate (MSER). They all together give you an idea of how well your model is performing on new data.

Script code

# Example of SVM hyperparameter tuning and performance measures for a binary class problem with the caret package
# Rainhard Findling
# 2014 / 07
# 
library(caret)
library(plyr)
library(ROCR)

# optional - enable multicore processing
library(doMC)
registerDoMC(cores=4)

data(iris)
# create binary problem: only use two classes
x <- iris[,1:2][iris$Species %in% c('versicolor', 'virginica'),]
y <- factor(iris[,5][iris$Species %in% c('versicolor', 'virginica')])

# we see that data will probably not be perfectly seperable using linear separation
featurePlot(x = x, y = y)

# split into train and test data. train = cv parameter grid search and model creation, test = performance analysis
set.seed(123456)
indexes_y_test <- createDataPartition(y = 1:length(y), times = 1, p = 0.3)[[1]]

# creation of weights - also fast for very big datasets
weights <- as.numeric(y[-indexes_y_test])
for(val in unique(weights)) {weights[weights==val]=1/sum(weights==val)*length(weights)/2} # normalized to sum to length(samples)

model <- train(method = 'svmLinear', 
           x = x[-indexes_y_test,], 
           y = y[-indexes_y_test], 
           weights = weights,
           maximize = T,
           tuneGrid = expand.grid(.C=3^(-15:15)),   
           preProcess = c('center', 'scale'),
           trControl = trainControl(method = 'cv', # cross validation
                                    number = 10,   # nr of cv sets
#                                     repeats = 5, # use with method=repeatcv
                                    returnResamp = 'none', # return accuracy per cv partition and parameter setting
                                    classProbs = T, # return prediction probabilities along with predicted classes
#                                     savePredictions=T, # returns all predictions (for all cv paritions) for each tuneGrid parameter set 
                                    returnData = F, # disable return of training data e.g. for big data sets
                                    allowParallel = T
            )
)
# we see some accuracy around 0.7-0.8
model
head(with(model, results[order(results$Kappa, decreasing=T),]))
# confusion matrix: model predicting classes of test data
table(predict.train(object = model, newdata = x[indexes_y_test,], type='raw'), y[indexes_y_test])
# prediction probabilities of test data classes
probs <- predict.train(object = model, newdata = x[indexes_y_test,], type='prob')[,1]
isPositiveClass <- y[indexes_y_test] == 'versicolor' # for a ROC curve there is a positive class (true match rate...) - defining that class here
pred <- prediction(probs, isPositiveClass)
perf <- performance(pred, 'tpr', 'fpr')
# plot: either
plot(perf, lwd=2, col=3)
# or
with(attributes(perf), plot(x=x.values[[1]], y=y.values[[1]], type='l')) 

# some metrics: AUC (area under curve), EER (equal error rate), MSER (minimum squared error rate), Cohen's Kappa etc.
AUC <- attributes(performance(pred, 'auc'))$y.value[[1]] # area under curve
df <- with(attributes(pred), data.frame(cutoffs=cutoffs[[1]], tp=tp[[1]], fn=fn[[1]], tn=tn[[1]], fp=fp[[1]], TMR=tp[[1]]/(tp[[1]]+fn[[1]]), TNMR=tn[[1]]/(tn[[1]]+fp[[1]])))
df$MSER <- with(df, sqrt((1-TMR)**2+(1-TNMR)**2))
MSER <- with(df, sqrt(TMR**2+TNMR**2)) # sqrt of minimum squared error rate = eucl. distance to point TMR=TNMR=1
i_eer <- with(df, which.min(abs(TMR-TNMR)))
EER <- with(df[i_eer,], mean(c(TMR,TNMR))) # equal error rate: mean would not be required when using ROCR as it's always exact
df$acc <- with(df, (tp + tn) / length(isPositiveClass)) # observed accuracy
df$acc_expected <- with(df, sum(isPositiveClass) * (tp+fp) / length(isPositiveClass) + sum(!isPositiveClass) * (tn+fn) / length(isPositiveClass)) / length(isPositiveClass) # expected accuracy
df$kappa <- with(df, (acc - acc_expected) / (1 - acc_expected)) # cohen's kappa
# graphical representation 
matplot(df[,6:11], lty=1:6, lwd=2, type='l'); legend('bottomright', legend=names(df[,6:11]), lty=1:6, lwd=2, col=1:6)

# characteristics for settings with best EER, MSER etc.
df[i_eer,] # curve point for equal error rate
df[order(df$acc, decreasing=T)[[1]],] # curve point for max accuracy
df[order(df$kappa, decreasing=T)[[1]],] # curve point for max kappa
df[order(df$MSER, decreasing=F)[[1]],] # curve point for minimum squared error rate

Results

With a test run the confusion matrix for the test data set was

             versicolor virginica
  versicolor         14         6
  virginica           2        10

which points out that 2 out of 16 versicolor samples were incorrectly classified as virginica and 6 out of 16 virginica samples were incorrectly classified as versicolor. Based on prediction probabilities we generated a ROC curve. The true match rate (TMR) indicates the rate of how much test data set samples were correctly identified as being of the positive class (in our case: versicolor). The false match rate (FMR, counterpart to true non match rate TNMR) indicates the rate of how much test data set samples were incorrectly identified as being of the positive class.rocAlthough ROC curves indicate classification performance well it’s hard to compare classifiers based directly on ROC curves. Here, different other metrics come in handy, such as the area under curve (AUC), equal error rate (EER), maximum accuracy, maximum kappa and minimum squared error rate (MSER = square of minimum euclidean distance between ROC curve and top left corner for TMR = TNMR = 1). In our case the AUC is about 0.8672, the other metrics (and their corresponding ROC curve points) are about:

             cutoffs   TMR TNMR accuracy accuracy_expected kappa  MSER
EER            0.550 0.750 0.75     0.75               0.5   0.5 0.354
Max acc.       0.515 0.875 0.75   0.8125               0.5 0.625 0.280
Max kappa      0.515 0.875 0.75   0.8125               0.5 0.625 0.280
Min MSER       0.515 0.875 0.75   0.8125               0.5 0.625 0.280

mctune: multicore hyperparameter tuning in R on the example of SVM car detection

April 20, 2014 1 comment

mctune

In Machine Learning (ML) tasks finding good hyperparameters for machine learning models is critical (hyperparameter optimization). In R there exist some packages containing routines doing that for you using grid search (constructing and testing all possible parameters as a grid, e.g. in David Meyer’s e1071 package).

Besides the very good routines already contained in those packages a while ago I liked to have multicore hyperparameter tuning and needed true positive rate (true match rate) and true negative rate (true non match rate) alongside overall error rate. Therefore I modified the tune method of the e1071 package to a) use multicore processing and b) return confusion matrizes showing average true positive / true negative rates per sample class and parameter setting.

The script inherits it’s GPL-2 licence from it’s original version. I modified tune specifically for my own needs when tuning SVM hyperparameters, therefore did not validate it for correctness or how it will work with other tune-able functions (e.g. KNN, random forest).

# modified version of "tune" from package "e1071"
# licence: GPL-2
# details: http://cran.r-project.org/web/packages/e1071/index.html
mctune <- function(method, train.x, train.y = NULL, data = list(),
                 validation.x = NULL, validation.y = NULL,
                 ranges = NULL, predict.func = predict,
                 tunecontrol = tune.control(),
                 mc.control=NULL,
                 confusionmatrizes=F,
                 ...
) {
  call <- match.call()

  require('plyr')
  require('parallel')

  ## internal helper functions
  resp <- function(formula, data) {
    model.response(model.frame(formula, data))
  }

  classAgreement2 <- function (tab) {
    n <- sum(tab)
    # correct classification rate
    if (!is.null(dimnames(tab))) {
      lev <- intersect(colnames(tab), rownames(tab))
      d <- diag(tab[lev, lev])
      p0 <- sum(d) / n
    } else {
      m <- min(dim(tab))
      d <- diag(tab[1:m, 1:m])
      p0 <- sum(d) / n
    }
    # confusion matrizes
    if(!confusionmatrizes) {
      list(p0=p0)
    } else if(is.null(dimnames(tab))) {
      stop('tables without dimension names are not allowed when generating confusionmatrizes.')
    }
    else {
      # generate confusion matrix for each class
      classnames <- unique(unlist(dimnames(tab)))
      truepositives <-  unlist(lapply(classnames, function(positiveclassname) { sum(d[positiveclassname])}))
      falsepositives <- unlist(lapply(classnames, function(positiveclassname) { sum(tab[positiveclassname,])-tab[positiveclassname,positiveclassname]}))
      falsenegatives <- unlist(lapply(classnames, function(positiveclassname) { sum(tab[,positiveclassname])-tab[positiveclassname,positiveclassname]}))
      truenegatives <- mapply(FUN=function(tp,fn,fp){ sum(tab)-tp-fn-fp }, truepositives, falsenegatives, falsepositives)
      confusions <- data.frame(classnames, truepositives, truenegatives, falsepositives, falsenegatives, row.names=NULL)
      colnames(confusions) <- c('class', 'tp', 'tn', 'fp', 'fn')
      list(p0=p0, confusions=confusions)
    }
  }

  ## parameter handling
  if (tunecontrol$sampling == "cross")
    validation.x <- validation.y <- NULL
  useFormula <- is.null(train.y)
  if (useFormula && (is.null(data) || length(data) == 0))
    data <- model.frame(train.x)
  if (is.vector(train.x)) train.x <- t(t(train.x))
  if (is.data.frame(train.y))
    train.y <- as.matrix(train.y)

  ## prepare training indices
  if (!is.null(validation.x)) tunecontrol$fix <- 1
  n <- (nrow(if (useFormula) data
            else train.x))
  perm.ind <- sample(n)
  if (tunecontrol$sampling == "cross") {
    if (tunecontrol$cross > n)
      stop(sQuote("cross"), " must not exceed sampling size!")
    if (tunecontrol$cross == 1)
      stop(sQuote("cross"), " must be greater than 1!")
  }
  train.ind <- (if (tunecontrol$sampling == "cross")
                  tapply(1:n, cut(1:n, breaks = tunecontrol$cross), function(x) perm.ind[-x])
                else if (tunecontrol$sampling == "fix")
                  list(perm.ind[1:trunc(n * tunecontrol$fix)])
                else
                ## bootstrap
                  lapply(1:tunecontrol$nboot,
                         function(x) sample(n, n * tunecontrol$boot.size, replace = TRUE))
  )

  ## find best model
  parameters <- (if(is.null(ranges))
                  data.frame(dummyparameter = 0)
                else
                  expand.grid(ranges))
  p <- nrow(parameters)
  if (!is.logical(tunecontrol$random)) {
    if (tunecontrol$random < 1)
      stop("random must be a strictly positive integer")
    if (tunecontrol$random > p) tunecontrol$random <- p
    parameters <- parameters[sample(1:p, tunecontrol$random),]
  }

  ## - loop over all models
  # concatenate arbitrary mc-arguments with explicit X and FUN arguments
  train_results<-do.call(what="mclapply", args=c(mc.control, list(X=1:p, ..., FUN=function(para.set) {
    sampling.errors <- c()
    sampling.confusions <- c()

    ## - loop over all training samples
    for (sample in 1:length(train.ind)) {
      repeat.errors <- c()
      repeat.confusions <- c()

      ## - repeat training `nrepeat' times
      for (reps in 1:tunecontrol$nrepeat) {

        ## train one model
        pars <- if (is.null(ranges))
          NULL
        else
          lapply(parameters[para.set,,drop = FALSE], unlist)

        model <- if (useFormula)
          do.call(method, c(list(train.x,
                                 data = data,
                                 subset = train.ind[[sample]]),
                            pars, list(...)
          )
          )
        else
          do.call(method, c(list(train.x[train.ind[[sample]],],
                                 y = train.y[train.ind[[sample]]]),
                            pars, list(...)
          )
          )

        ## predict validation set
        pred <- predict.func(model,
                             if (!is.null(validation.x))
                               validation.x
                             else if (useFormula)
                               data[-train.ind[[sample]],,drop = FALSE]
                             else if (inherits(train.x, "matrix.csr"))
                               train.x[-train.ind[[sample]],]
                             else
                               train.x[-train.ind[[sample]],,drop = FALSE]
        )

        ## compute performance measure
        true.y <- if (!is.null(validation.y))
          validation.y
        else if (useFormula) {
          if (!is.null(validation.x))
            resp(train.x, validation.x)
          else
            resp(train.x, data[-train.ind[[sample]],])
        } else
          train.y[-train.ind[[sample]]]

        if (is.null(true.y)) true.y <- rep(TRUE, length(pred))

        if (!is.null(tunecontrol$error.fun))
          repeat.errors[reps] <- tunecontrol$error.fun(true.y, pred)
        else if ((is.logical(true.y) || is.factor(true.y)) && (is.logical(pred) || is.factor(pred) || is.character(pred))) { ## classification error
          l <- classAgreement2(table(pred, true.y))
          repeat.errors[reps] <- (1 - l$p0) # wrong classification rate
          if(confusionmatrizes) {
            repeat.confusions[[reps]] <- l$confusions
          }
        } else if (is.numeric(true.y) && is.numeric(pred)) ## mean squared error
          repeat.errors[reps] <- crossprod(pred - true.y) / length(pred)
        else
          stop("Dependent variable has wrong type!")
      }
      sampling.errors[sample] <- tunecontrol$repeat.aggregate(repeat.errors)
      # TODO potentially implement separate aggregation of tp tn fp fn values. currently those are taken with correlate to the least error.
      if(confusionmatrizes) {
        sampling.confusions[[sample]] <- repeat.confusions[repeat.errors == sampling.errors[sample]][[1]]
      }
    }
    # TODO potentially implement separate aggregation of tp tn fp fn values. currently uses the same as for error / variance aggregation
    if(!confusionmatrizes) {
      list(
        model.error=tunecontrol$sampling.aggregate(sampling.errors),
        model.variance=tunecontrol$sampling.dispersion(sampling.errors))
    } else {
      # create one confusion data frame
      confusions <- ldply(sampling.confusions, data.frame)
      # calculate aggregate / disperse values per class
      confusions <- ldply(lapply(X=split(confusions, confusions$class), FUN=function(classdf) {
        class=unique(classdf$class)
        # only take numeric values
        classdf[,c('tp','tn','fp','fn')]
        # calculate aggregate / disperse values for this class
        aggregated <- apply(X=classdf[,c('tp','tn','fp','fn')], MAR=2, FUN=tunecontrol$sampling.aggregate)
        dispersions <- apply(X=classdf[,c('tp','tn','fp','fn')], MAR=2, FUN=tunecontrol$sampling.dispersion)
        # make 1 row dataframe out of it (combine rows later with outer ldply)
        t(data.frame(c(value=aggregated,dispersion=dispersions)))
      }), data.frame)
      colnames(confusions) <- c('class', 'tp.value', 'tn.value', 'fp.value', 'fn.value', 'tp.dispersion', 'tn.dispersion', 'fp.dispersion', 'fn.dispersion')
      # calculate mean confusion matrix values (mean of all classes) for best model
      confusions.mean <- data.frame(t(apply(X=confusions[,c('tp.value','tn.value','fp.value','fn.value','tp.dispersion','tn.dispersion','fp.dispersion','fn.dispersion')], MAR=2, FUN=mean)))
      colnames(confusions.mean) <- c('tp.value', 'tn.value', 'fp.value', 'fn.value', 'tp.dispersion', 'tn.dispersion', 'fp.dispersion', 'fn.dispersion')
      list(
        model.error=tunecontrol$sampling.aggregate(sampling.errors),
        model.variance=tunecontrol$sampling.dispersion(sampling.errors),
        model.confusions=confusions,
        model.confusions.mean=confusions.mean
      )
    }
  })))
#   print('mctune: mclapply done.')
#   print(train_results)
  model.errors <- unlist(lapply(train_results,function(x)x$model.error))
  model.variances <- unlist(lapply(train_results,function(x)x$model.variance))
  if(confusionmatrizes){
    model.confusions <- lapply(train_results,function(x)x$model.confusions)
    model.confusions.mean <- ldply(lapply(train_results,function(x)x$model.confusions.mean), data.frame)
  }

  ## return results
  best <- which.min(model.errors)
  pars <- if (is.null(ranges))
    NULL
  else
    lapply(parameters[best,,drop = FALSE], unlist)
  structure(list(best.parameters  = parameters[best,,drop = FALSE],
                 best.performance = model.errors[best],
                 method           = if (!is.character(method))
                   deparse(substitute(method)) else method,
                 nparcomb         = nrow(parameters),
                 train.ind        = train.ind,
                 sampling         = switch(tunecontrol$sampling,
                                           fix = "fixed training/validation set",
                                           bootstrap = "bootstrapping",
                                           cross = if (tunecontrol$cross == n) "leave-one-out" else
                                             paste(tunecontrol$cross,"-fold cross validation", sep="")
                 ),
                 performances     = if (tunecontrol$performances) cbind(parameters, error = model.errors, dispersion = model.variances),
                 confusionmatrizes = if (confusionmatrizes) model.confusions,
                 confusionmatrizes.mean = if(confusionmatrizes) model.confusions.mean,
                 best.confusionmatrizes = if(confusionmatrizes) model.confusions[[best]],
                 best.confusionmatrizes.mean = if(confusionmatrizes) model.confusions.mean[best,],
                 best.model       = if (tunecontrol$best.model) {
                   modeltmp <- if (useFormula)
                     do.call(method, c(list(train.x, data = data),
                                       pars, list(...)))
                   else
                     do.call(method, c(list(x = train.x,
                                            y = train.y),
                                       pars, list(...)))
                   call[[1]] <- as.symbol("best.tune")
                   modeltmp$call <- call
                   modeltmp
                 }
  ),
            class = "tune"
  )
}

Additional parameters

mc.control

A list of parameters that go to mclapply from the parallel package. Example: mc.control=list(mc.cores=3, mc.preschedule=F)

confusionmatrizes

Binary flag indicating if confusion matrizes should be generated or not.

Additional return values

tune$confusionmatrizes

List of confusion matrizes. The matrizes are sorted the same way as tune$performances is. Each matrix is a data frame listing the sample classes (class), it’s corresponding absolute average values (value) and standard deviation (dispersion) for true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). An pre-drawn example of such a confusionmatrix from our car detection below could look like this (where “yes” means “car” and “no” means “no car”):

class tp.value tn.value fp.value fn.value tp.dispersion tn.dispersion fp.dispersion fn.dispersion
   no     49.5     53.4      1.6      0.5      5.212165      5.621388     0.8432740     0.5270463
  yes     53.4     49.5      0.5      1.6      5.621388      5.212165     0.5270463     0.8432740

In above example for the “yes” class the tp.value=53.4 means that on average 53.4 samples were correctly classified as “yes” using the corresponding parameter setting.

Car detection

As ML example we do a car detection. The task is to decide if a given image contains or not contains a car (therefore leaving out searching mechanisms like sliding window). As data source we use the UIUC Image Database for Car Detection. In order to reduce the amount of features (pixels) and “blur” images (which discards unnecessary details and likely even increases recognition probabilities) I resized the images from original 100×40 to 50×20 pixels. For a human it’s still easy to decide if an image shows a car or not at this scale:

No car 1

No car 1

No car 2

No car 2

No car 3

No car 3

Car 1

Car 1

Car 2

Car 2

Car 3

Car 3

 

 

As ML model we use a support vector machine with radial (Gaussian) kernel, which leaves us with the cost and gamma parameter to be tuned – which we do using a grid search and 10 fold cross validation.

# load data
print('loading data...')
library('png')
car_neg <- lapply(dir(pattern="neg.*png"),readPNG)
car_pos <- lapply(dir(pattern="pos.*png"),readPNG)
data <- c(car_neg,car_pos)
labels <- factor(c(rep('no',length(car_neg)), rep('yes', length(car_pos))))
# put data to dataframe
data <- t(array(unlist(data),dim=c(length(data[[1]]),length(data))))
data<-data.frame(data)
# look at a car
# image(array(as.numeric(data[600,]),dim=c(20,50)))
# look at correlation between first pixels
# plot(data.frame(data,labels)[,1:8],pch='.') 

library('e1071')
source('mctune.R')
t<-mctune(confusionmatrizes=T,
          mc.control=list(mc.cores=3, mc.preschedule=F),
          method=svm,
          ranges=list(type='C',
                      kernel='radial',
                      gamma=3^(-10:-4),
                      cost=3^(-8:8)),
          train.x=data,
          train.y=labels,
          validation.x=NULL, #validation.x and .y are only used with tune.control sampling='boot' and 'fixed'.
          validation.y=NULL,
          tunecontrol=tune.control(sampling='cross',
                                   cross=10,
                                   performances=T,
                                   nrepeat=1,
                                   best.model=T))
# extract FMR FNMR from our positive class
p <- lapply(X=t$confusionmatrizes, FUN=function(x){
  p <- x[x$class=='yes',]
  p[-1]
})
p <- ldply(p)
t$performances <- data.frame(t$performances, p)
t$performances$FMR <- t$performances$fp.value / (t$performances$tn.value + t$performances$fp.value)
t$performances$FNMR <- t$performances$fn.value / (t$performances$tp.value + t$performances$fn.value)
# print list of errors
t$performances[with(t$performances, order(error)),]

# different plots of parameters and errors
library('scatterplot3d')
scatterplot3d(t$performances$cost, t$performances$gamma, t$performances$error,log='xy', type='h', pch=1, color='red')
plot(t$performances[,3:5],log='xy',pch=4)
# paramters and errors: from best to worst
plot(t$performances[with(t$performances, order(error)),]$cost,log='y', col='blue')
points(t$performances[with(t$performances, order(error)),]$gamma, col='green')
points(t$performances[with(t$performances, order(error)),]$error, col='black')
# points(t$performances[with(t$performances, order(error)),]$FMR, col='red')
# points(t$performances[with(t$performances, order(error)),]$FNMR, col='orange')

Using this setup with the first search we obtain an error rate of about 0.02 for gamma 10^-3 and cost >= 3 (this indicates the search area for the next step). The corresponding false match rate (FMR) and false non match rate (FNMR) are in the range of 0.01 and 0.029.

Image classification using SVMs in R

February 24, 2013 6 comments

Recently I did some Support Vector Machine (SVM) tests in R (statistical language with functional parts for rapid prototyping and data analysis — somehow similar to Matlab, but open source ;)) for my current face recognition projects. To get my SVMs up and running in R, using image data as in- and output, I wrote a small demo script for classifying images. As test data I used 2 classes of images (lines from left top to right bottom and lines from left bottom to right top), with 10 samples each — like these:

ImageImageImageImageImageImage
The complete image set is available here.

For SVM classification simple train and test sets get used — for more sophisticated problems n-fold cross validation for searching good parameter settings is recommended instead. For everybody who did not yet work with SVMs, I’d recommend reading something about how to start with “good” SVM classification, like the pretty short and easy to read “A Practical Guide to Support Vector Classification” from Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (the LIBSVM-inventors).

Update: added parallel processing using parallel and mclapply for loading image data (for purpose of demonstration only, loading 10 images in parallel does not make a big difference ;)).

print('starting svm demo...')

library('png')
library('e1071')
library('parallel')

# load img data
folder<-'.'
file_list <- dir(folder, pattern="png")
data <- mclapply(file_list, readPNG, mc.cores=2)
# extract subject id + img nr from names
subject_ids <- lapply(file_list, function(file_name) as.numeric(unlist(strsplit(file_name, "_"))[1]))
# rename subject id's to c1 and c2 for more clear displaying of results
subject_ids[subject_ids==0]='c1'
subject_ids[subject_ids!='c1']='c2'
img_ids <- lapply(file_list, function(file_name) as.numeric(unlist(strsplit(unlist(strsplit(file_name, "_"))[2], "\\."))[1]))

# specify which data should be used as test and train by the img nrs
train_test_border <- 7
# split data into train and test, and bring into array form to feed to svm
train_in <- t(array(unlist(data[img_ids < train_test_border]), dim=c(length(unlist(data[1])),sum(img_ids < train_test_border))))
train_out <- unlist(subject_ids[img_ids < train_test_border])
test_in <- t(array(unlist(data[img_ids >= train_test_border]), dim=c(length(unlist(data[1])),sum(img_ids >= train_test_border))))
test_out <- unlist(subject_ids[img_ids >= train_test_border])

# train svm - try out different kernels + settings here
svm_model <- svm(train_in, train_out, type='C', kernel='linear')

# evaluate svm
p <- predict(svm_model, train_in)
print(p)
print(table(p, train_out))
p <- predict(svm_model, test_in)
print(p)
print(table(p, test_out))

print('svm demo done!')