Machine Learning and Neural Networks group

Department of Systems and Computer Science
University of Florence
Via Santa Marta 3
50139 Firenze - Italy

Protein Structure Prediction

People


Paolo Frasconi
Alessandro Vullo
Andrea Passerini
Alessio Ceroni

Table of contents

Introduction

Secondary Structure

Cysteines Bonding State and Connectivity

Fine Grained Contact Maps

Coarse Grained Contact Maps

References


Introduction


Proteins are polypeptide chains carrying out most of the basic functions of life at the molecular level. These linear chains fold in complex 3D structures whose shape is responsible of proteins' behavior. Each ring of the chain consists of one of the 20 amino acid existing in nature (the protein's primary structure). Proteins are synthesized inside cells. The instructions used by the cell to build a protein are written in the DNA. DNA's double helix is made by two long chains composed by four different nucleotides. The DNA contains genes, sequences of nucleotides codifying for proteins. Each triplet of nucleotides in a gene correspond to an amino acid of the encoded protein.


Example of protein 3D structure (2bnh00 horse-shoe domain from CATH).


Thanks to several genome sequencing projects, the entire DNA sequence of many organisms has been experimentally determined. Inside each genome the positions of genes have been discovered using specific signals, particular sequences of nucleotides used by cells during transcription. From these identified genes the proteins' primary sequences have been extracted. Unfortunately our knowledge often stops here. The proteins' 3D (tertiary) structure, essential to study their functions, remains almost unknown. There exist physical methods to estimate the coordinates of each atoms of a protein, but they need the protein to be crystallized, a time consuming process which, at the moment, is impossible to automate and serialize. Even if the number of proteins whose primary sequence is known counts in the number of millions, only few thousands of them have been crystallized and their 3D structure deposited in the Protein Data Bank Unfortunately, neither alternative approaches based on nuclear magnetic resonance cannot be applied at the genomic scale. It is therefore becoming increasingly important to predict protein's tertiary structure ab initio from its amino acid sequence, using insights obtained from already known structures.

top


Secondary Structure


Proteins present local regularities in their 3D structure, formed and maintained by hydrogen bonds between atoms. These regular structures are referred to as the protein's secondary structure. The most common configurations observed in proteins are called alpha helices and beta strands, while all the other conformations are referred to as coils. A group of adjacent amino acids sharing the same conformation are members of a segment of secondary structure. Segments of secondary structure are well defined and stable aggregations of amino acids which strongly influence the chain's folding and which usually carry out specifical functions inside the protein, like a list of words in a particular language forming a meaningful phrase.


Schematic representations of secondary structures segments.


Reliable predictors of the secondary structure of a protein are fundamental to study its folding and functions. Threading algorithms which attempts to study the fold of unknown proteins, use predicted secondary structure sequences to search in databases of known folds. Moreover, the predicted secondary structure content of a protein can be used to identify its folding family (CATH, SCOP) and thus estimate its functions.

Here at the Machine Learning and Neural Network Group, we have developed our own predictor of secondary structure. We created a two stages architecture which uses multiple alignments [1] and neural networks, and it is capable of state-of-the-art performances as measured on the PDB select dataset. For additional informations on the subject read our technical report [2]. The prediction server can be accessed by this page.

top


Cysteines Bonding State and Connectivity


Cysteines are one of the twenty amino acids that constitute proteins. The oxidized form of cysteines plays a fundamental role in the stabilization process of the native conformation of proteins. The covalent bonds formed by cysteines, known as disulfide bridges, may connect very distant portion of the sequence. The location of these bonds is a very informative constraint on the conformational space, and the associated information represents a significant step towards folding or understanding structural properties of the protein. Prediction of disulfide bridges from sequence is thus one of the important (and difficult) tasks in structural genomics. Recent works in this area suggest methodologies based on two steps. First, the disulfide-bonding state of each cysteine is predicted (a binary classification problem). Subsequently, once candidate cysteines are known, other algorithms can be used to predict the actual location of disulfide bridges.


Example of cystein connectivity pattern.


Here at the Machine Learning and Neural Network Group, we have developed predictors for both aspects of the problem. We created a two stages architecture which uses Support Vector Machines for the prediction of cysteines bonding state [3]. Then, we employed Generic Recursive Neural Networks to predict the pattern of connectivity of bonded cysteines [4]. Both predictors outperform all the current competitors. We also realized an automated server that combines these predictors in the first available tool for predicting disulfide bonds in proteins. The server is accessible via a web interface.

top


Fine Grained Contact Map


The 3D structure of a protein can be captured to a large extent by its distance map. A distance map is a bidimensional symmetric matrix which contains distances between couple of elements of the protein. Different resolutions can be defined irrespective to the level of structure we use: from atoms, to amino acids, to segments of secondary structure. In case of amino-acids distances are usually calculated between their C-Alpha atoms.


From distance matrix to contact maps (protein 1auga).


Distance maps are not easily handled by machine learning methods, because their prediction would be a very difficult regression task. Therefore, contact maps are created identifying contacts using cutoffs on distances. A fine-grained contact map is defined using distances between amino-acids. The predicted contact map can then be used to reconstruct protein tertiary structure [5].

We are currently studying the problem of fine-grained contact maps prediction. We want to investigate the connections between contact-maps and secondary structure for improving both the predictions. Moreover, we are researching on the possibility of an optimal reconstruction of the protein 3D structure from predicted contact maps.

top


Coarse Grained Contact Map


The prediction of fine-grained contact maps (here) poses several problems from a machine learning point of views: training sets are huge and highly unbalanced. Moreover, the fact that two amino acids are in contacts depends on the whole spatial configuration of the protein. Therefore, the contact state of a couple of amino-acids heavily depends on the global information contained in the chain. Unfortunately, no machine learning technique seems able of efficiently using the information contained in such long sequences [6].

A lower resolution contact maps could then be used. For this purpose coarse-grained contact maps are defined using contacts between secondary structure elements. In this way, the task of predicting the tertiary structure of a protein is splitted in two steps: initially a reliable predictor of secondary structure must be realized (here), then a predictor for the coarse-grained contact map can be used to identify contacts between the segments.

We focused on the second step of the solution, studying the various aspect of coarse-grained contact map prediction. We then employed generic recursive neural networks to realize an efficient predictor which proves to be a valid step in the direction of ab-initio tertiary structure prediction [7].

top


References


  1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman.  Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs.  Journal of Nucleic Acids Research, 25:3389--3402, 1997
  2. Ceroni, A. and Frasconi, P. and Passerini, A. and Vullo, A., A Combination of Support Vector Machines and Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction, NNMLg Technical report, 2003.
  3. P. Frasconi, A. Passerini, and A. Vullo.  A Two-Stage SVM Architecture for Predicting the Disulfide Bonding State of Cysteines.  Proc. IEEE Workshop on Neural Networks for Signal Processing, pp. 25--34, 2002.
  4. A. Vullo and P. Frasconi.  A Recursive Connectionist Approach for Predicting Disulfide Connectivity in Proteins.  To appear in the Proceedings of the Eighteenth Annual ACM Symposium on Applied Computing (SAC 2003), Melbourne, FL.
  5. M. Vendruscolo, E. Kussell and E. Domany.  Recovery of Protein Structure from Contact Maps.  Folding and Design, 2:295-306, 1997
  6. Y. Bengio, P. Simard and P. Frasconi.  "Learning Long-Term Dependencies with Gradient Descent is Difficult".  IEEE Transactions on Neural Networks, 5(2): 157--166, 1994.
  7. A. Vullo and P. Frasconi.  A Bi-Recursive Neural Network Architecture for the Prediction of Protein Coarse Contact Maps.  Proceedings of the 1st IEEE Computer Society Bioinformatics Conference, Stanford Univ. August 2002.

Copyright notice

The documents listed in this site are provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

top

8th July 2003. Machine Learning and Neural Networks Group. For questions and comments: aceroni@dsi.unifi.it.