2007, February 5: Andrea Passerini, Claudia Andreini, Sauro Menchetti, Antonio Rosato and Paolo Frasconi,
"Predicting Zinc Binding at the Proteome Level",
BMC Bioinformatics. Volume 8, Issue 39.
2005, September: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi and Massimiliano Pontil,
"Wide Coverage Natural Language Processing using Kernel Methods and Neural Networks for Structured Data",
Pattern Recognition Letters. Volume 26, Issue 12, September 2005, Pages 1896-1906.
Special issue on Artificial Neural Networks in Pattern Recognition - Edited by Simone Marinai and Marco Gori
2006, April 2-5: Sauro Menchetti, Andrea Passerini, Paolo Frasconi, Claudia Andreini and Antonio Rosato,
"Improving Prediction of Zinc Binding Sites by Modelling the Linkage between Residues Close in Sequence",
In Proceedings of the Tenth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2006), Venice, Italy, April 2-5, 2006
2005, August 7-11: Sauro Menchetti, Fabrizio Costa and Paolo Frasconi,
"Weighted Decomposition Kernels",
In Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany, August 7-11, 2005
2005, March 17-19: Sauro Menchetti, Fabrizio Costa and Paolo Frasconi,
"Weighted Decomposition Kernels for Protein Subcellular Localization",
In Proceedings of BITS Annual Meeting 2005 (BITS 2005), Milan, Italy
2003, September 12-13: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi and Massimiliano Pontil,
"Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data",
In Proceedings of IAPR - TC3 International Workshop on Artificial Neural Networks in Pattern Recognition (ANNPR 2003), Florence, Italy
1999: Sauro Menchetti and Andrea Fedeli,
"Biological Sequences Alignment by Hidden Markov Models",
In Proceedings of Sixth National Congress of Italian Association for Artificial Intelligence AI*IA 1999, Bologna
2006, April 5-7: Fabrizio Costa, Sauro Menchetti, Alessio Ceroni, Andrea Passerini and Paolo Frasconi,
"Decomposition Kernels for Natural Language Processing"
In Learning Structured Information in Natural Language Applications Workshop, Trento, Italy, April 3, 2006,
hosted in conjunction with the 11th Conference of the European Chapter of the Association for Computational
Linguistics (EACL 2006), April 5-7, 2006
2002, December 9-14: Fabrizio Costa, Paolo Frasconi, Sauro Menchetti and Massimiliano Pontil,
"Comparing Convolutional Kernels and Recursive Neural Networks on a Wide-Coverage Computational Analysis of Natural Language"
In NIPS 2002 Workshop on Unreal Data: Principles of Modeling Nonvectorial Data (Invited), Advances in Neural Information Processing Systems 15 (NIPS 2002), December 9-14, 2002, Vancouver, British Columbia, Canada
Kernel Methods: SVMs and Voted Perceptron
The success of machine learning methods
depends on their ability to solve pattern recognition and
regression problems. The kernel methodology provides a powerful
and unified framework for many disciplines, from neural networks
to statical pattern recognition, to machine learning and data
mining. Any kernel methods solution comprises two parts:
- a module that performs the mapping into the feature space;
- a learning algorithm designed to discover linear pattern in that space.
There are two main reasons why this approach should work. First of all,
detecting linear relations has been the focus of much research in
statistics and machine learning for decades and the resulting
algorithms are both well understood and efficient. Secondly, we
will see that a kernel function represents a computational
shortcut which makes it possible to represent linear patterns
efficiently in high-dimensional spaces, to ensure adequate
Support Vector Machines (SVMs), originally developed by Vapnik
and co-workers, are the most widespread kernel-based
machine learning algorithm nowadays. Immediately after their
introduction, an increasing number of researcher have worked on
both the algorithmic and theoretical analysis of these systems,
creating in just a few years what is effectively a new research
direction, merging concepts from distant disciplines as
statistics, functional analysis, optimization and machine
SVMs represent a very specific class of algorithms, characterized
by the use of kernels, the absence of local minima, the sparseness
of the solution and the capacity control obtained by acting on the
margin or on other dimension independent quantities as the number
of support vectors. They provide a new approach to the problem of
pattern recognition based on the statistical learning theory.
Moreover producing sparse dual representation of the hypothesis
results in an efficient algorithm which at the same time enforces
the learning biases suggested by the generalization theory.
SVMs differ radically from comparable approaches such as neural
networks: they find always a global optimum because of they are
formulated as a Quadratic Programming (QP) optimization problem
with box constraints. Their simple geometric interpretation
provides fertile ground for explaining how they work in a very
SVMs are largely characterized by the choice of kernel which maps
the inputs into a feature space in which they are separated by a
linear hypothesis. Since the data only appear through inner
products, a Mercer kernel can be plugged into the algorithm
permitting to separate the data by a non linear decision rule.
Often the feature space is a very high
dimensional one but the so-called curse of dimensionality
problem is cleverly solved turning to the statistical learning
theory. It says essentially that the difficulty of an estimation
problem increases drastically with the dimension n of the space
since one needs exponentially many patterns in the dimension n
to sample the space properly. But the statistical learning theory
tells us that learning into a very high dimensional feature space
can be simpler if one uses a low complexity, i.e. simple class of
decision rule. All the variability and richness that one needs to have
a powerful function class is then introduced by the mapping phi through
the kernel function. In short: not the dimensionality but the
complexity of the function matters. In addition, for certain
feature spaces and corresponding mapping phi, there is a
highly effective trick for computing scalar products in a high
dimensional feature space using kernel functions.
So SVMs represent a complete framework where several concepts are
combined together to form a powerful theory: dimension independent
generalization bounds, Mercer kernels and RKHS, regularization
theory and QP optimization represent therefore the foundations of
Voted Perceptron (VP) is an online mistake-driven algorithm for
binary classification based on Rosenblatt's Perceptron algorithm.
It is simpler to implement and
more efficient in terms of computational time with respect to
SVMs, although there are no convergence guarantees. On the other
hand, as experimentally shown, the
performance obtained by VP tends to the performance of
maximal-margin classifiers. The basic vector-based formulation
can be easy kernelized, so it can be efficiently employed in very
high dimensional space and can be extended to generic data
structures in a very simple way. Unlike perceptron algorithm, VP
stores and uses all prediction vectors which are generated after
every mistake to classify a new example. Each such vector is
weighted by the number of training examples it will correctly
classify until the next mistake will occur: so good prediction
vectors that correctly classify a lot of examples, will have a
larger coefficient in the weighted sum.
Kernels on Structured Data
Since many real world data are structured and consequently have no
natural representation in a vector space, statistical learning in
structured and relational domains is rapidly becoming one of the
central areas of machine learning.
By structured data we mean data that is
formed by combining simpler components into more complex items,
frequently involving a recursive use of simpler objects of the
same type. Typically it will be easy to compare the simpler
components either with base kernels or using an inductive argument
over the structure of the objects. Examples of structured data
include vectors, strings and sequences but also more complex
objects such as trees, images and graphs. The design of kernels
for structured data enables the same algorithms and analysis of
structural pattern recognition to be applied to new application
domains. The extension of kernels to structured data provides the
way for analysing very different sets of data-types, opening up
possibilities such as discovering clusters in a set of trees,
learning classifications of graphs and so on. It is therefore by
designing specific kernels for structured data that kernel methods
can demonstrate their full flexibility.
Probably the most important data type after vectors is that of
symbol strings of varying lengths. This type of data is
commonplace in bioinformatics applications, where it can be used
to represent proteins as sequence of amino acids, genomic DNA as
sequence of nucleotides, promoters and other structures. Partly
for this reason, a great deal of research has been devoted to it
in the last few years. Many other application domains consider
data in the form of sequences so that many of the techniques have
a history of development within computer science, as for example
the study of string algorithms. Kernel have been developed to
compute the inner product between images of strings in
high-dimensional feature spaces using dynamic programming
techniques. The set of kernels on strings and more generally on
structured objects, makes it possible for kernels methods to
operate in a domain that traditionally has belonged to syntactical
pattern recognition, in this way providing a bridge between that
field and statistical pattern analysis.
Ordinal Regression, Preference Model and Ranking
Work on learning theory has mostly concentrated on classification
and regression. However, there are many applications in which it
is desirable to order rather than to classify instances: these
problems arise frequently in social sciences, in information
retrieval, in econometric models and in classical statistics where
human preferences play a major role.
In a general supervised learning task, the goal is to learn a
function f which best models
the probabilistic relation between the input space X
and the target space Y minimizing a risk functional
depending on a predefined loss function. The properties of the
target set Y define different learning problems:
- if Y is a finite unordered set, we have a classification problem;
- if Y is a metric space, e.g., the set of real numbers, we have a regression problem.
Ordinal regression, partial ranking and preference model tasks do
not fit in any two previous classes but share properties of both
classification and regression problems:
- Y is a finite ordered set;
- Y is not a metric space, that is the distance between two elements is not defined.
So, as in classification problems, we have to assign a possible
label to a new instance, but similar to regression problems, the
label set admits an order relation.
More precisely, in a ranking problem we have to sort a set of
competing alternatives by their importance, while in a preference
problem we are only interested in the best element: note that the
preference problem is a particular case of ranking.
During my M.Sc. thesis, I designed an extension of Naive Bayes
classifier for text categorization. I worked in the framework
of semi-supervised learning where a set of both labelled
and unlabelled data is given and the task is to construct a
better classifier exploiting unlabelled data with respect to
use only labelled data.
Last Update: March 11, 2007.
Sauro Menchetti Home Page.