A Support Vector Machine for Large-Scale Regression and Classification Problems

Ronan Collobert (collober@idiap.ch)

IDIAP, CP 592, rue du Simplon 4, 1920 Martigny, Switzerland

SVMTorch is a new implementation of Vapnik's Support Vector Machine that works both for classification and regression problems, and that has been specifically tailored for large-scale problems (such as more than 20000 examples, even for input dimensions higher than 100).

The source code is free for academic use. It must not be modified or distributed without prior permission the author. When using SVMTorch in your scientific work, please cite the following article:

Ronan Collobert and Samy Bengio, Support Vector Machines for Large-Scale Regression Problems, IDIAP-RR-00-17, 2000.

The software has been successfully compiled on Sun/SOLARIS, Intel/LINUX and Alpha/OSF operating systems. Your can download it from ftp.idiap.ch/pub/learning/SVMTorch.tgz.

First, you should download the source code from ftp.idiap.ch/pub/learning/SVMTorch.tgz and the examples from ftp.idiap.ch/pub/learning/TrainData.tgz. Put this two archive files in the same directory, and decompress them with

zcat SVMTorch.tgz | tar xf -
zcat TrainData.tgz | tar xf -

It creates two new directories : "SVMTorch" and "TrainData".

Now, go in the "SVMTorch" directory and edit the Makefile. You should only have to change the following lines, depending on your specific platform :

# C-compiler
#CC=gcc
CC=cc
# C-Compiler flags
#CFLAGS=-Wall -W -O9 -funroll-all-loops -finline -fomit-frame-pointer -ffast-math
CFLAGS=-native -fast -xO5
# linker
#LD=gcc
LD=cc
# linker flags
#LFLAGS=-Wall -W -O9 -funroll-all-loops -finline -fomit-frame-pointer -ffast-math
LFLAGS=-native -fast -xO5
# libraries
LIBS=-lm

The default configuration is set for a machine running with the Sun Workshop compiler. An alternate (commented) configuration is proposed for the GNU gcc compiler.

Type "make all" and pray.

It should compile without any warning.

For some platform, you could have to change the include files needed for "times", a non-standard function used by svm_torch. You would have to edit the file "general.h" and change the lines

#ifdef I_WANT_TIME
#include <sys/times.h>
/*#include <limits.h>*/
#include <time.h>
#endif

If it doesn't work or if you don't want to measure the time of the learning machine, just comment the line :
#define I_WANT_TIME

Note that in "general.h" you can comment the line
#define USEDOUBLE
in order to do the computations in float. IT'S A BAD IDEA : svm_torch needs precision.

If everything went well, you should have two programs : "svm_torch" and "svm_test". The first one is the learning machine and the second one is the testing machine.
If you want to show all the options, just run svm_torch or svm_test without any parameter.

To test the program in classification, try :
svm_torch -v -ae ../TrainData/classif_train.dat ../TrainData/model_dummy

It takes less than two minutes on a 300Mhz computer. You should have around 914 support vectors (this number could slightly change depending on the precision of your machine).

To test the SVM on the train data, try :
svm_test -ae ../TrainData/model_dummy ../TrainData/classif_train.dat
You should have around 0.78% missclassified.

To test the program in regression, try :
svm_torch -v -ae -rm -st 900 -eps 20 ../TrainData/regress_train.dat ../TrainData/model_dummy
You should have around 597 support vectors.

Test the model with :
svm_test -ae ../TrainData/model_dummy ../TrainData/regress_train.dat
The mean squared error should be around 187.2.

The general syntax of svm_torch and svm_test is
svm_torch [options] example_file model_file
svm_test [options] model_file test_file

Where "example_file" is your training set file, "test_file" is your testing set file and "model_file" is the SVM-model created by svm_torch.

All options are described when you launch svm_torch or svm_test without any argument.
By default, svm_torch is a classification machine. If you want the regression machine, use option -rm.
You should always use option -v with svm_torch : it gives a current error during learning. This error is only an indicator. It can oscillate.

There are two main input formats for "input_file" and "test_file" in SVMTorch : an ASCII format, and a binary one.

The ASCII format is the following:
<Number n of training/testing samples> <Dimension d of each sample+1>
<a11> <a12> <a13> .... <a1d> <a1_out>
.
.
.
<an1> <an2> <an3> .... <and> <an_out>

where <aij> is an ASCII floating point number corresponding to the j-th value of the i-th example and <ai_out> is the i-th desired output (in classification, it should be +1/-1).

With the same notation, the binary format is:
<Number n of training/testing samples> <Dimension d of each sample> <a11>...<a1d> ....... <an1>...<and> <a1_out>... <an_out>
(First save the input table, then the output table, all in binary)

There is another special input format for svm_test, when you don't have the desired output. (To use with the -no option).
The ASCII version of this format is :
<Number n of training/testing samples> <Dimension d of each sample>
<a11> <a12> <a13> .... <a1d>
.
.
.
<an1> <an2> <an3> .... <and>

And the binary version is :
<Number n of training/testing samples> <Dimension d of each sample> <a11>...<a1d> ....... <an1>...<and>

Number of visitors: