Docente
Lecturer
- Paolo Frasconi
- Dipartimento di Sistemi e Informatica
- Email:
- Ricevimento: Mercoledi 15:00-17:00Office hours: Wednesday 3:00pm-5:00pm
Orario delle lezioni
Class schedule
| Mercoledi | 17:00-19:00 | Aula 207 (S. Marta) |
| Giovedi | 16:00-18:00 | Aula 207 (S. Marta) |
| Venerdi | 14:00-17:00 | Aula 204 (S. Marta) |
| Wednesday | 5:00pm-7:00pm | Room 207 (S. Marta) |
| Thursday | 4:00pm-6:00pm | Room 207 (S. Marta) |
| Friday | 2:00pm-5:00pm | Room 204 (S. Marta) |
Obiettivi
Goals
- Al termine del corso conoscerete numerosi algoritmi fondamentali ed alcuni algoritmi avanzati per l'apprendimento statistico, avrete basi di teoria computazionale dell'apprendimento e sarete in grado di progettare soluzioni allo stato dell'arte per risolvere problemi applicativi.
- You will learn about several fundamental and some advanced algorithms for statistical learning, you will know the basics of computational learning theory, and will be able to design state-of-the-art solutions to application problems.
Crediti
Credits
- L'insegnamento vale 9 crediti formativi, formalmente corrispondenti a circa 72 ore di didattica assistita (lezioni ed esercitazioni in classe), e 153 ore di studio individuale ed attività sperimentale.
- Per altri corsi di laurea (p.es. Ingegneria dell'automazione e Informatica per la facoltà di Scienze) il corso vale i crediti formativi stabiliti dal vostro ordinamento (tipicamente 5). In questi casi potete abbandonare le lezioni una volta conclusa la parte sui modelli grafici probabilistici.
- The course is worth 9 credits (about 72 hours in class, and 153 hours of individual study and experimental work).
- For other degree programs (e.g. M.Sc. in Automation, or M.Sc. in Computer Science) the course is worth the credits that have been assigned to it in your curriculum (usually 5). In these cases, you may drop classes after we complete the part on probabilistic graphical models.
Prerequisiti
Prerequisites
- I prerequisiti necessari includono una buona conoscenza di un linguaggio di programmazione e solide basi di matematica (analisi, algebra lineare, teoria della probabilità). Conoscenze preliminari di tecniche di ottimizzazione e di statistica sono utili ma non strettamente necessarie.
- A good knowledge of a programming language, and a solid background in mathematics (calculus, linear algebra, and probability theory) are necessary prerequisites to this course. Previous knowledge of optimization techniques and statistics would be useful but not strictly necessary.
Esame
Exams
- L'esame consiste di un'unica prova orale. L'argomento è a vostra scelta ma vi consiglio di concordarlo con me prima di iniziare la preparazione.
- Di solito viene assegnata la lettura di alcuni articoli scientifici e potrebbe essere richiesto riprodurre alcuni risultati sperimentali.
- Dovrete esporre brevemente (30 minuti) il vostro argomento. Assicuratevi che la presentazione includa un'introduzione al problema studiato, una breve rassegna della letteratura rilevante, descrizione tecnica del metodo e, se appropriato, una descrizione dettagliata del lavoro sperimentale. Potete usare strumenti multimediali oer la vostra presentazione. Siete responsabili della comprensione dei concetti rilevanti e della teoria.
- Potete formare gruppi di due studenti (tre in casi eccezionali e motivati) per lavori di tipo sperimentale. In questi casi assicuratevi che il contributo apportato da ciascuno di voi sia chiaramente riconoscibile.
- Non ci sono date predefinite per l'esame. Vi prego di concordare con me la data in cui volete sostenerlo.
- There is a single oral final exam. You can choose the exam topic but you are strongly advised to discuss with me before you begin working on it.
- Typically, you will be assigned a set of papers to read and you may be asked to reproduce some experimental results.
- You will be required to give a short (30 min) presentation during the exam. Please ensure that your presentation includes an introduction to the problem being addressed, a brief review of relevant literature, technical derivation of methods, and, if appropriate, a detailed description of experimental work. You are allowed to use multimedia tools to prepare your presentation. You are responsible for understanding all the relevant concepts and the underlying theory.
- You can work in groups of two to carry out experimental works (three is an exceptional number that you must motivate clearly). If you do so, please ensure that personal contributions to the overall work are clearly identifiable.
- There are no scheduled dates for the exam. Please contact me to arrange your exam date.
Libri suggeriti
Suggested books
- [HTF09] T. Hastie, R. Tibshirani, and J. Friedman (2009) The Elements of Statistical Learning Springer.
- [KF09] D. Koller and N. Friedman (2009) Probabilistic Graphical Models: Principles and Techniques MIT Press
- [D08] L. De Raedt (2008) Logical and Relational Learning. Springer.
- + Altri libri potenzialmente utili:
- + Additional books that may be useful:
Programma sintetico
Topics
- Apprendimento con supervisione. Regressione. Classificazione. Teoria computazionale dell'apprendimento. Modelli lineari generalizzati. Apprendimento non supervisionato. Macchine a vettori di supporto. Nuclei. Boosting e bagging. Modelli grafici probabilistici (semantica, inferenza, apprendimento di parametri e struttura). Apprendimento con uscite strutturate. Programmazione logica induttiva e apprendimento statistico-relazionale. Applicazioni.
- Supervisised learning. Regression. Classification. Computational learning theory. Generalized linear models. Unsupervised learning. Support vector machines. Kernels. Boosting and bagging. Probabilistici graphical models (semantics, inference, parameters and structure learning). Structured-output learning. Inductive logic programming and statistical relational learning. Applications.
Calendario e materiale didattico
Calendar and handouts
| Data | Argomenti | Letture e materiale* |
| 30/09/2009 | Aspetti organizzativi del corso. Introduzione: l'apprendimento automatico per risolvere problemi difficili in IA. | HTB09 1. |
|---|---|---|
| 1/10/2009 | Apprendimento con supervisione. Tipologie di ingressi e uscite. Esempi di applicazioni. Compiti: regressione, classificazione, classificazione multi-classe, ranking, uscite strutturate. Schema astratto di algoritmo di apprendimento supervisionato. Assunzioni sui dati. | HTB09 2. Appunti della lezione |
| 2/10/2009 | Regressione lineare ordinaria. Stima ai minimi quadrati. Interpretazione geometrica. Analisi statistica. | HTB09 3.2. |
| 7/10/2009 | Teorema di Gauss-Markov. Decomposizione bias-varianza. Selezione di un sottoinsieme di predittori: best-subset, forward selection, backward selection. Discussione. | HTB09 3.2.2, 3.3, Dimostrazione del teorema di Gauss-Markov |
| 8/10/2009 | Esercitazione: OLR in Matlab, espansione in funzioni base polinomiali e RBF, regolarizzazione con Ridge regression. | HTB09 3.4.1, Codice Matlab |
| 9/10/2009 | Sparsità e selezione di sottoinsiemi di predittori. Cenni su compressive sensing. Lasso. Forward stagewise regression. LARS. | HTB09 3.4. CVX software for Matlab Mini-survey on compressive sensing: Baraniuk (2007) LARS paper: Efron et al. (2004) LARS software: R Matlab |
| 14/10/2009 | Introduzione alla classificazione: approccio generativo e discriminante. Il principio di massima verosimiglianza. Esempi. Proprietà degli stimatori a massima verosimiglianza: consistenza, equivarianza, gaussianità asintotica, ottimalità | Wasserman 2004, cap. 9 Massima verosimiglianza |
| 15/10/2009 | Modelli generativi per la classificazione. Analisi discriminante lineare (Gaussiana). | HTB09 4.3. |
| 16/10/2009 | Errore di Bayes. Esempi. Regressione logistica. Ottimizzazione con Newton-Raphson. Modelli lineari generalizzati (esempio: regressione di Poisson). | HTB09 4.4. M. Jordan. Why the logistic function? - UAI 1995 |
| 21/10/2009 | Regressione softmax (regressione logistica multiclasse). Introduzione a Naive Bayes. | HTB09 4.4, 6.6.3. T. Mitchell: "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" |
| 23/10/2009 | Naive Bayes 2. Applicazione a data set generali con attributi categorici e numerici. Modelli di eventi alternativi per la categorizzazione del testo. Decomposizione dell'errore di generalizazione. Learning curves. Naive Bayes vs. Logistic Regression. |
A. McCallum and K. Nigam.
A Comparison of Event Models for Naive Bayes Text Classification - AAAI 1998
U. Fayyad and K. Irani: Multi-interval discretization of continuous-valued attributes for classification learning - IJCAI 1993 A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes - NIPS 2002 |
| 28/10/2009 | Iperpiano a massimo margine. Teoria di Karush Kuhn Tucker per risolvere problemi di ottimizzazione convessi in forma duale. Struttura della soluzione. | HTB09 4.5. Schoelkopf and Smola (2002) Learning with Kernels Ch. 6, 7.3 |
| 29/10/2009 | Macchine a vettori di supporto (C-SVC, nu-SVC). Analisi dei vettori di supporto, scelta di C. | HTB09 12.1, 12.2. |
| 30/10/2009 | Introduzione informale ai kernels. Panoramica sui risolutori per SVM. SVM come minimizzazione del rischio empirico regolarizzato. Hinge vs. logit loss. | HTB09 12.3.1, 12.3.2. L. Bottou and C.J. Lin (2006). Support Vector Machine Solvers. L. Bottou Trade-offs of Large-Scale Learning - NIPS 2008 LibSVM software |
| 4/11/2009 | Regressione a vettori di supporto. Kernels validi, feature map, teorema di Mercer. Feature map di Mercer. Introduzione alla feature map RKHS. | HTB09 12.3.6, 5.8. Schoelkopf and Smola (2002) Learning with Kernels Ch. 2 Cucker and Smale (2001): On the Mathematical Foundations of Learning |
| 6/11/2009 | Spazi di Hilbert con kernel rirpoducente. Kernel trick, trasformazione di algoritmi esistenti via kernel: (voted) perceptron, ridge regression, (dimostrazione del representer theorem per la loss quadratica). Esempi di progetto di kernels: istogrammi per immagini, Kernels su insiemi. Multi-instance learning. | HTB09 5.8, 12.3.7 Y. Freund and R.E. Schapire: Large Margin Classification Using the Perceptron Algorithm K. Grauman and T. Derrel: The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features - ICCV 2005 |
| 11/11/2009 | Kernels convoluzionali. Kernels per dati strutturati: k-grams per la classificazione di proteine, kernels per alberi sintattici, kernels di Tanimoto kernels per la classificazione di piccole molecole. |
D. Haussler (1999):
Convolution Kernels on Discrete Structures - MLJ 1999
C. Leslie et al.: The spectrum kernel: A string kernel for SVM protein classification - PSB 2002 P. Kuksa et al.: Scalable Algorithms for String Kernels with Inexact Matching - NIPS 2008 M. Collins and N. Duffy Convolution Kernels for Natural Language - NIPS 2001 L. Ralaivola et al.: Graph kernels for chemical informatics - NN 2005 |
| 12/11/2009 | Jesse Davis (U. Washington). Inductive Learning: Beyond the Standard Setting | J. Davis and P. Domingos. Deep Transfer via Second-Order Markov Logic ICML 2009 |
| 13/11/2009 | Introduzione ai modelli grafici probabilistici. Probabilità e probabilità soggettive (credenze). Independenza condizionale e modelli di dipendenza. Markov networks: semantica e decomposizione della distribuzione congiunta (Teorema di Hammersley-Clifford). |
Dettagli in KF09 Cap 1-4 D. Koller et al. (2007) Graphical models in a Nutshell in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 18/11/2009 | Bayesian networks. Introduzione all'inferenza (eliminazione di variabili, belief propagation). | A. Darwiche Bayesian Networks Communications of the ACM, to appear. |
| 19/11/2009 | Idee principali nell'algoritmo del junction tree. Apprendimento di parametri in Bayesian networks: Stima a massima verosimiglianza con dati completi. |
D. Koller et al. (2007)Graphical models in a Nutshell in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 20/11/2009 | Apprendimento con dati mancanti. Expectation-Maximization. Modelli basati su misture (misture di Gaussiane). K-means. | HTB09 8.5, 13.2.1, 13.2.3, 14.3.6, 14.3.7. T. Minka Expectation-Maximization as lower bound maximization Tech. Report. 1999. |
| 25/11/2009 | Plates, dipendenza tra parametri in Bayesian networks, idee fondamentali sulle reti probabilistiche relazionali. Apprendimento con ascesa del gradiente e EM per Bayesian networks con variabili non osservate. |
K. Nigam et al. (2006)Semi-Supervised Text Classification Using EM
in O Chapelle et al.: Semi-supervised learning L. Getoor et al. (2007)Probabilistic Relational Models in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 26/11/2009 | Apprendimento bayesiano di Bayesian networks. Stima massima a posteriori e regolarizzazione. Distribuzioni Beta e Dirichlet come priors. | D. HeckermanA tutorial on learning with Bayesian networks.Technical Report MSR-TR-95-06, Microsoft Research, 1996. |
| 02/12/2009 | Apprendimento relazionale. Dati sequenziali. Hidden Markov Models. Procedura forward backward di Baum-Welch. Algoritmo di Viterbi per inferenza MAP. | L.R. Rabiner A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2):257-286, 1989. |
| 03/12/2009 | EM per l'apprendimento negli HMMs. HMMs come modelli log-lineari. Etichettamento discriminativo di sequenze: Conditional random fields. | C. Sutton and A. McCallum (2007) An Introduction to Conditional Random Fields for Relational Learning. in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 04/12/2009 | Seminario (M. Lippi). Markov logic networks e applicazioni. |
P. Domingos et al. (2008)
Markov Logic in L. De Raedt et al. Probabilistic Inductive Logic Programming. M. Lippi and P. Frasconi (2009) Prediction of protein beta-residue contacts by Markov logic networks with grounding-specific weights Bioinformatics 25(18):2326-2333. Alchemy (Markov Logic software) |
| 09/12/2009 | Seminario (M. Lippi). Programmazione logica induttiva (FOIL, Tilde). |
R. Quinlan and R.M. Cameron-Jones (1993)
FOIL: A midterm report, ECML '93. H. Blockeel and L. De Raedt (1998) Top-down induction of first-order logical decision trees. Artificial Intelligence 101(1-2):285--297. |
| 10/12/2009 | Topic modeling. Latent Dirichlet Allocation. |
D. Blei et al. (2003)
Latent Dirichlet Allocation Journal of Machine Learning Research 3, 993--1022 T.L Griffiths and M. Steyvers (2004) Finding scientific topics Proceedings of the National Academy of Sciences 101(Supp. 1):5228 Mallet (LDA and CFR Software) |
| 11/12/2009 | Ensemble methods, bagging, boosting, random forests. |
HTB09 8.7, 10, 15. Y. Freund and R.E. Schapire (1999) A Short Introduction to Boosting Journal of Japanese Society for Artificial Intelligence 14(5):771-780. L. Breiman (2001) Random ForestsMachine Learning 45. Random forest software |
| Date | Topics | Readings and handouts* |
| 30/09/2009 | Administrivia. Introduction. Solving difficult AI problems: the machine learning approach. | HTB09 1. |
|---|---|---|
| 1/10/2009 | Supervised learning. Types of inputs and outputs. Application examples. Tasks: regression, classification, multi-class classification, ranking, structured-output learning. Abstract view of a supervised learning algorithm. Assumptions on data. | HTB09 2. Class notes |
| 2/10/2009 | Ordinary Linear Regression. Least squares estimates. Geometric interpretation. Statistical analysis. | HTB09 3.2. |
| 7/10/2009 | Gauss-Markov Theorem. Bias-variance decomposition. Subset selection: best subset, forward and backward stepwise. Discussion. | HTB09 3.2.2, 3.3, Proof of Gauss-Markov Theorem |
| 8/10/2009 | Practice: OLR in Matlab (toy examples), basis functions (polynomial, RBF), regularization with Ridge regression. | HTB09 3.4.1, Matlab code |
| 9/10/2009 | Sparsity and subset selection. Basic ideas of compressive sensing. Lasso. Forward stagewise regression and LARS. | HTB09 3.4. CVX software for Matlab Mini-survey on compressive sensing: Baraniuk (2007) LARS paper: Efron et al. (2004) LARS software: R Matlab |
| 14/10/2009 | Introduction to classification: generative and discriminative approaches. The maximum likelihood principle. Examples. Properties of maximum likelihood estimators: consistency, equivariance, asyntotic normality, optimality. | Wasserman 2004, ch. 9 Maximum likelihood |
| 15/10/2009 | Generative models for classification. Linear (Gaussian) discriminant analysis. | HTB09 4.3. |
| 16/10/2009 | Bayes error rate. Examples. Logistic regression. Newton-Raphson optimization. Generalized linear models (example: Poisson regression). | HTB09 4.4. M. Jordan. Why the logistic function? - UAI 1995 |
| 21/10/2009 | Softmax regression (multiclass logistic regression). Introduction to Naive Bayes. | HTB09 4.4, 6.6.3. T. Mitchell: "Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression" |
| 23/10/2009 | Naive Bayes 2. Application to general data sets with categorical and numerical attributes. Alternative event models for text categorization. Decomposition of the generalization error. Learning curves. Naive Bayes vs. Logistic Regression. |
A. McCallum and K. Nigam.
A Comparison of Event Models for Naive Bayes Text Classification - AAAI 1998
U. Fayyad and K. Irani: Multi-interval discretization of continuous-valued attributes for classification learning - IJCAI 1993 A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes - NIPS 2002 |
| 28/10/2009 | Maximal margin hyperplane. Karush Kuhn Tucker theory for solving convex optimization problems in dual form. Structure of the MMH solution. | HTB09 4.5. Schoelkopf and Smola (2002) Learning with Kernels Ch. 6, 7.3 |
| 29/10/2009 | Support vector machines (C-SVC, nu-SVC). Analysis of support vectors. Choosing C. | HTB09 12.1, 12.2. |
| 30/10/2009 | Informal introduction to kernels. Overview of SVM solvers. SVM as regularized empirical loss minimization. Hinge vs. logit loss. | HTB09 12.3.1, 12.3.2. L. Bottou and C.J. Lin (2006). Support Vector Machine Solvers. L. Bottou Trade-offs of Large-Scale Learning - NIPS 2008 LibSVM software |
| 4/11/2009 | Support vector regression. Valid kernels, feature map, and Mercer theorem. Mercer feature map. Introduction to the RKHS feature map. | HTB09 12.3.6, 5.8. Schoelkopf and Smola (2002) Learning with Kernels Ch. 2 Cucker and Smale (2001): On the Mathematical Foundations of Learning |
| 6/11/2009 | Reproducing kernel Hilbert spaces. Kernel trick and kernelizing existing algorithms: (voted) perceptron, ridge regression, (proof of representer theorem for square loss). Examples of kernel design: image histograms. Kernels on sets. Multi-instance learning. | HTB09 5.8, 12.3.7 Y. Freund and R.E. Schapire: Large Margin Classification Using the Perceptron Algorithm K. Grauman and T. Derrel: The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features - ICCV 2005 |
| 11/11/2009 | Convolution kernels. Kernels for structured instances: k-grams for protein classification, subtree kernels for parse trees, Tanimoto kernels for small molecule classification. |
D. Haussler (1999):
Convolution Kernels on Discrete Structures - MLJ 1999
C. Leslie et al.: The spectrum kernel: A string kernel for SVM protein classification - PSB 2002 P. Kuksa et al.: Scalable Algorithms for String Kernels with Inexact Matching - NIPS 2008 M. Collins and N. Duffy Convolution Kernels for Natural Language - NIPS 2001 L. Ralaivola et al.: Graph kernels for chemical informatics - NN 2005 |
| 12/11/2009 | Jesse Davis (U. Washington). Inductive Learning: Beyond the Standard Setting | J. Davis and P. Domingos. Deep Transfer via Second-Order Markov Logic ICML 2009 |
| 13/11/2009 | Introduction to probabilistic graphical models. Review of probabilities and subjective probability (beliefs). Conditional independence and dependency models. Markov networks: semantics and decomposition of the joint distribution (Hammersley-Clifford theorem). |
Details in KF09 Ch 1-4 D. Koller et al. (2007)Graphical models in a Nutshell in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 18/11/2009 | Bayesian networks. Introduction to inference (variable elimination, belief propagation). | A. Darwiche Bayesian Networks Communications of the ACM, to appear. |
| 19/11/2009 | Sketch of the junction tree algorithm. Parameter learning in Bayesian networks: Maximum likelihood estimation with complete data. |
D. Koller et al. (2007)Graphical models in a Nutshell in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 20/11/2009 | Learning with missing data. Expectation-Maximization. Mixture models (mixtures of Gaussians). K-means. | HTB09 8.5, 13.2.1, 13.2.3, 14.3.6, 14.3.7. T. Minka Expectation-Maximization as lower bound maximization Tech. Report. 1999. |
| 25/11/2009 | Plates, parameter dependences in Bayesian networks, core ideas of probabilistic relational networks. Gradient ascent and EM-based learning in Bayesian networks with hidden variables. |
K. Nigam et al. (2006)Semi-Supervised Text Classification Using EM
in O Chapelle et al.: Semi-supervised learning L. Getoor et al. (2007)Probabilistic Relational Models in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 26/11/2009 | Bayesian learning of Bayesian networks. Maximum a posteriori and regularization. Beta and Dirichlet distributions as priors. | D. HeckermanA tutorial on learning with Bayesian networks.Technical Report MSR-TR-95-06, Microsoft Research, 1996. |
| 02/12/2009 | Relational learning. Sequential data. Hidden Markov Models. Baum-Welch's forward backward procedure, Viterbi's algorithm for MAP inference. | L.R. Rabiner A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2):257-286, 1989. |
| 03/12/2009 | EM learning for HMMs. HMMs as log-linear models. Discriminative sequence labeling: Conditional random fields. | C. Sutton and A. McCallum (2007) An Introduction to Conditional Random Fields for Relational Learning. in L. Getoor and B. Taskar: Introduction to Statistical Relational Learning |
| 04/12/2009 | Guest lecture (M. Lippi). Markov logic networks and applications. |
P. Domingos et al. (2008)
Markov Logic in L. De Raedt et al. Probabilistic Inductive Logic Programming. M. Lippi and P. Frasconi (2009) Prediction of protein beta-residue contacts by Markov logic networks with grounding-specific weights Bioinformatics 25(18):2326-2333. Alchemy (Markov Logic software) |
| 09/12/2009 | Guest lecture (M. Lippi). Inductive logic programming (FOIL, Tilde). |
R. Quinlan and R.M. Cameron-Jones (1993)
FOIL: A midterm report, ECML '93. H. Blockeel and L. De Raedt (1998) Top-down induction of first-order logical decision trees. Artificial Intelligence 101(1-2):285--297. |
| 10/12/2009 | Topic modeling. Latent Dirichlet Allocation. |
D. Blei et al. (2003)
Latent Dirichlet Allocation Journal of Machine Learning Research 3, 993--1022 T.L Griffiths and M. Steyvers (2004) Finding scientific topics Proceedings of the National Academy of Sciences 101(Supp. 1):5228 Mallet (LDA and CFR Software) |
| 11/12/2009 | Ensemble methods, bagging, boosting, random forests. |
HTB09 8.7, 10, 15. Y. Freund and R.E. Schapire (1999) A Short Introduction to Boosting Journal of Japanese Society for Artificial Intelligence 14(5):771-780. L. Breiman (2001) Random ForestsMachine Learning 45. Random forest software |

