Sauro Menchetti Sauro Menchetti

Department of Systems and Computer Science
University of Florence
Via di Santa Marta, 3
50139 Florence - Italy
Tel: +39 055 4796361
Fax: +39 055 4796363


Sauro Menchetti, PhD

Ph.D. in Computer Science and Automation Engineering at DSI,
Department of Systems and Computer Science, University of Florence
Supervisors: Prof. Paolo Frasconi, Prof. Giovanni Soda
Member of Machine Learning and Neural Networks Group

Contact Informations

DSI, Department of Systems and Computer Science
University of Florence
Via di Santa Marta, 3
50139 Florence - Italy
Lab. Phone: +39 055 4796 361
Fax: +39 055 4796 363


Biography

I was born on December 9th, 1975 in Arezzo, Italy. I received my M.Sc. Degree in Computer Science Engineering on April 20, 2001 from the University of Florence with a final mark of 110 out of 110 with honours, discussing a thesis entitled "Extension of Naive Bayes Classifier for Text Categorization using Partial Labelled Data". For seven months I collaborated in a research project at the DSI (University of Florence) and in January 2002 I started my Ph.D. studies in Computer Science and Automation Engineering at DSI. From October 2003 to April 2004 I was a visitor researcher at Computer Science Department of University College London for working together on a project about Kernels on Structured Data for Preference and Ranking Problems with Prof. Massimiliano Pontil. I received my Ph.D. in Computer Science and Automation Engineering on March 3, 2006 from the University of Florence, discussing a thesis entitled "Learning Preference and Structured Data: Theory and Applications".


Ph.D. Thesis

2005, December 31: "Learning Preference and Structured Data: Theory and Applications" [Abstract] [Table of Contents] [BibTeX] [Slides]


Publications

Journals

Proceedings

Technical Reports

Workshops

Talks

Book Parts


Software


M.Sc. Thesis

2001, April 20: "Extension of Naive Bayes Classifier for Text Categorization using Partial Labelled Data" [BibTeX]


Links

Related Sites on Neural Networks and Machine Learning

Web Change 2001

Curriculum Vitae

Fishing


Research Interests

Kernel Methods: SVMs and Voted Perceptron

The success of machine learning methods depends on their ability to solve pattern recognition and regression problems. The kernel methodology provides a powerful and unified framework for many disciplines, from neural networks to statical pattern recognition, to machine learning and data mining. Any kernel methods solution comprises two parts:

  • a module that performs the mapping into the feature space;
  • a learning algorithm designed to discover linear pattern in that space.

There are two main reasons why this approach should work. First of all, detecting linear relations has been the focus of much research in statistics and machine learning for decades and the resulting algorithms are both well understood and efficient. Secondly, we will see that a kernel function represents a computational shortcut which makes it possible to represent linear patterns efficiently in high-dimensional spaces, to ensure adequate representational power.

Support Vector Machines (SVMs), originally developed by Vapnik and co-workers, are the most widespread kernel-based machine learning algorithm nowadays. Immediately after their introduction, an increasing number of researcher have worked on both the algorithmic and theoretical analysis of these systems, creating in just a few years what is effectively a new research direction, merging concepts from distant disciplines as statistics, functional analysis, optimization and machine learning.

SVMs represent a very specific class of algorithms, characterized by the use of kernels, the absence of local minima, the sparseness of the solution and the capacity control obtained by acting on the margin or on other dimension independent quantities as the number of support vectors. They provide a new approach to the problem of pattern recognition based on the statistical learning theory. Moreover producing sparse dual representation of the hypothesis results in an efficient algorithm which at the same time enforces the learning biases suggested by the generalization theory. SVMs differ radically from comparable approaches such as neural networks: they find always a global optimum because of they are formulated as a Quadratic Programming (QP) optimization problem with box constraints. Their simple geometric interpretation provides fertile ground for explaining how they work in a very easy manner.

SVMs are largely characterized by the choice of kernel which maps the inputs into a feature space in which they are separated by a linear hypothesis. Since the data only appear through inner products, a Mercer kernel can be plugged into the algorithm permitting to separate the data by a non linear decision rule. Often the feature space is a very high dimensional one but the so-called curse of dimensionality problem is cleverly solved turning to the statistical learning theory. It says essentially that the difficulty of an estimation problem increases drastically with the dimension n of the space since one needs exponentially many patterns in the dimension n to sample the space properly. But the statistical learning theory tells us that learning into a very high dimensional feature space can be simpler if one uses a low complexity, i.e. simple class of decision rule. All the variability and richness that one needs to have a powerful function class is then introduced by the mapping phi through the kernel function. In short: not the dimensionality but the complexity of the function matters. In addition, for certain feature spaces and corresponding mapping phi, there is a highly effective trick for computing scalar products in a high dimensional feature space using kernel functions.

So SVMs represent a complete framework where several concepts are combined together to form a powerful theory: dimension independent generalization bounds, Mercer kernels and RKHS, regularization theory and QP optimization represent therefore the foundations of SVMs.

Voted Perceptron (VP) is an online mistake-driven algorithm for binary classification based on Rosenblatt's Perceptron algorithm. It is simpler to implement and more efficient in terms of computational time with respect to SVMs, although there are no convergence guarantees. On the other hand, as experimentally shown, the performance obtained by VP tends to the performance of maximal-margin classifiers. The basic vector-based formulation can be easy kernelized, so it can be efficiently employed in very high dimensional space and can be extended to generic data structures in a very simple way. Unlike perceptron algorithm, VP stores and uses all prediction vectors which are generated after every mistake to classify a new example. Each such vector is weighted by the number of training examples it will correctly classify until the next mistake will occur: so good prediction vectors that correctly classify a lot of examples, will have a larger coefficient in the weighted sum.

Kernels on Structured Data

Since many real world data are structured and consequently have no natural representation in a vector space, statistical learning in structured and relational domains is rapidly becoming one of the central areas of machine learning.

By structured data we mean data that is formed by combining simpler components into more complex items, frequently involving a recursive use of simpler objects of the same type. Typically it will be easy to compare the simpler components either with base kernels or using an inductive argument over the structure of the objects. Examples of structured data include vectors, strings and sequences but also more complex objects such as trees, images and graphs. The design of kernels for structured data enables the same algorithms and analysis of structural pattern recognition to be applied to new application domains. The extension of kernels to structured data provides the way for analysing very different sets of data-types, opening up possibilities such as discovering clusters in a set of trees, learning classifications of graphs and so on. It is therefore by designing specific kernels for structured data that kernel methods can demonstrate their full flexibility.

Probably the most important data type after vectors is that of symbol strings of varying lengths. This type of data is commonplace in bioinformatics applications, where it can be used to represent proteins as sequence of amino acids, genomic DNA as sequence of nucleotides, promoters and other structures. Partly for this reason, a great deal of research has been devoted to it in the last few years. Many other application domains consider data in the form of sequences so that many of the techniques have a history of development within computer science, as for example the study of string algorithms. Kernel have been developed to compute the inner product between images of strings in high-dimensional feature spaces using dynamic programming techniques. The set of kernels on strings and more generally on structured objects, makes it possible for kernels methods to operate in a domain that traditionally has belonged to syntactical pattern recognition, in this way providing a bridge between that field and statistical pattern analysis.

Ordinal Regression, Preference Model and Ranking

Work on learning theory has mostly concentrated on classification and regression. However, there are many applications in which it is desirable to order rather than to classify instances: these problems arise frequently in social sciences, in information retrieval, in econometric models and in classical statistics where human preferences play a major role.

In a general supervised learning task, the goal is to learn a function f which best models the probabilistic relation between the input space X and the target space Y minimizing a risk functional depending on a predefined loss function. The properties of the target set Y define different learning problems:

  • if Y is a finite unordered set, we have a classification problem;
  • if Y is a metric space, e.g., the set of real numbers, we have a regression problem.

Ordinal regression, partial ranking and preference model tasks do not fit in any two previous classes but share properties of both classification and regression problems:

  • Y is a finite ordered set;
  • Y is not a metric space, that is the distance between two elements is not defined.

So, as in classification problems, we have to assign a possible label to a new instance, but similar to regression problems, the label set admits an order relation. More precisely, in a ranking problem we have to sort a set of competing alternatives by their importance, while in a preference problem we are only interested in the best element: note that the preference problem is a particular case of ranking.

Text Categorization

During my M.Sc. thesis, I designed an extension of Naive Bayes classifier for text categorization. I worked in the framework of semi-supervised learning where a set of both labelled and unlabelled data is given and the task is to construct a better classifier exploiting unlabelled data with respect to use only labelled data.


Last Update: March 11, 2007. Sauro Menchetti Home Page.