Protein Structure Prediction
Introduction
Proteins are polypeptide chains carrying out most of the basic
functions of life at the molecular level. These linear chains fold in
complex 3D structures whose shape is responsible of proteins'
behavior. Each ring of the chain consists of one of the 20
amino acid existing in nature (the protein's primary structure).
Proteins are synthesized inside cells. The instructions used by the
cell to build a protein are written in the DNA. DNA's double helix is
made by two long chains composed by four different nucleotides. The
DNA contains genes, sequences of nucleotides codifying for
proteins. Each triplet of nucleotides in a gene correspond to an amino
acid of the encoded protein.

Example of protein 3D structure (2bnh00 horse-shoe domain from CATH).
Thanks to several genome sequencing projects, the entire DNA sequence
of many organisms has been experimentally determined.
Inside each genome the positions of genes have been discovered using
specific signals, particular sequences of nucleotides used by cells
during transcription. From these identified genes the proteins'
primary sequences have been extracted. Unfortunately our knowledge often
stops here. The proteins' 3D (tertiary) structure, essential to
study their functions, remains almost unknown. There exist physical
methods to estimate the coordinates of each atoms of a protein, but
they need the protein to be crystallized, a time consuming process
which, at the moment, is impossible to automate and serialize. Even if
the number of proteins whose primary sequence is known counts in the
number of millions, only few thousands of them have been crystallized
and their 3D structure deposited in the
Protein Data Bank
Unfortunately, neither alternative approaches based
on nuclear magnetic resonance cannot be applied at the genomic
scale. It is therefore becoming increasingly important to
predict protein's tertiary structure ab initio from its
amino acid sequence, using insights obtained from already known
structures.
top
|
Secondary Structure
Proteins present local regularities in their 3D structure, formed and maintained
by hydrogen bonds between atoms. These regular structures are referred to as the
protein's secondary structure. The most common configurations observed
in proteins are called alpha helices and beta strands, while
all the other conformations are referred to as coils. A group of adjacent
amino acids sharing the same conformation are members of a segment of
secondary structure. Segments of secondary structure are well defined and stable
aggregations of amino acids which strongly influence the chain's
folding and which usually carry out specifical functions inside the
protein, like a list of words in a particular language forming a
meaningful phrase.

Schematic representations of secondary structures segments.
Reliable predictors of the secondary structure of
a protein are fundamental to study its folding and
functions. Threading algorithms which attempts to
study the fold of unknown proteins, use predicted
secondary structure sequences to search in
databases of known folds. Moreover, the predicted
secondary structure content of a protein can be
used to identify its folding family
(CATH,
SCOP) and
thus estimate its functions.
Here at the Machine Learning and Neural Network
Group, we have developed our own predictor of
secondary structure. We created a two stages
architecture which uses multiple alignments
[1] and neural networks, and
it is capable of state-of-the-art performances as
measured on the PDB select dataset. For additional
informations on the subject read our technical
report [2].
The prediction server can be accessed by
this page.
top
|
Cysteines Bonding State and Connectivity
Cysteines are one of the twenty amino acids that constitute proteins.
The oxidized form of cysteines plays a fundamental
role in the stabilization process of the native conformation of
proteins. The covalent bonds formed by cysteines, known as disulfide
bridges, may connect very distant portion of the sequence.
The location of these bonds is a very informative constraint on
the conformational space, and the associated information represents a
significant step towards folding or understanding structural
properties of the protein. Prediction of disulfide bridges
from sequence is thus one of the important (and difficult) tasks in
structural genomics. Recent works in this area suggest methodologies based on
two steps. First, the disulfide-bonding state of each cysteine is predicted
(a binary classification problem). Subsequently, once candidate cysteines are known,
other algorithms can be used to predict the actual location of disulfide bridges.

Example of cystein connectivity pattern.
Here at the Machine Learning and Neural Network Group, we have developed
predictors for both aspects of the problem. We created a two stages architecture
which uses Support Vector Machines for the prediction of cysteines bonding state
[3]. Then, we employed Generic Recursive Neural Networks
to predict the pattern of connectivity of bonded cysteines [4].
Both predictors outperform all the current competitors. We also realized an automated
server that combines these predictors in the first available tool for predicting
disulfide bonds in proteins. The server is accessible via a
web interface.
top
|
Fine Grained Contact Map
The 3D structure of a protein can be captured to a large extent by its distance
map. A distance map is a bidimensional symmetric matrix which contains distances
between couple of elements of the protein. Different resolutions can be defined
irrespective to the level of structure we use: from atoms, to amino acids, to
segments of secondary structure. In case of amino-acids distances are usually
calculated between their C-Alpha atoms.
 
From distance matrix to contact maps (protein 1auga).
Distance maps are not easily handled by machine learning methods, because their
prediction would be a very difficult regression task. Therefore, contact maps are
created identifying contacts using cutoffs on distances. A fine-grained contact map
is defined using distances between amino-acids. The predicted contact map can then be
used to reconstruct protein tertiary structure [5].
We are currently studying the problem of fine-grained contact maps prediction.
We want to investigate the connections between contact-maps and secondary structure
for improving both the predictions. Moreover, we are researching on the possibility
of an optimal reconstruction of the protein 3D structure from predicted contact maps.
top
|
Coarse Grained Contact Map
The prediction of fine-grained contact maps (here) poses
several problems from a machine learning point of views: training sets
are huge and highly unbalanced. Moreover, the fact that two amino acids
are in contacts depends on the whole spatial configuration of the protein.
Therefore, the contact state of a couple of amino-acids heavily depends on
the global information contained in the chain. Unfortunately, no machine
learning technique seems able of efficiently using the information contained
in such long sequences [6].
A lower resolution contact maps could then be used. For this purpose coarse-grained
contact maps are defined using contacts between secondary structure elements.
In this way, the task of predicting the tertiary structure of a protein is splitted
in two steps: initially a reliable predictor of secondary structure must be
realized (here), then a predictor for the coarse-grained
contact map can be used to identify contacts between the segments.
We focused on the second step of the solution, studying the various aspect
of coarse-grained contact map prediction. We then employed generic recursive neural
networks to realize an efficient predictor which proves to be a valid step in
the direction of ab-initio tertiary structure prediction [7].
top
|
References
-
S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman. 
Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. 
Journal of Nucleic Acids Research, 25:3389--3402, 1997
-
Ceroni, A. and Frasconi, P. and Passerini, A. and Vullo, A.,
A Combination of Support Vector Machines and Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction,
NNMLg Technical report, 2003.
-
P. Frasconi, A. Passerini, and A. Vullo. 
A Two-Stage SVM Architecture for Predicting the Disulfide Bonding State of Cysteines. 
Proc. IEEE Workshop on Neural Networks for Signal Processing, pp. 25--34, 2002.
-
A. Vullo and P. Frasconi. 
A Recursive Connectionist Approach for Predicting Disulfide Connectivity in Proteins. 
To appear in the Proceedings of the Eighteenth Annual ACM Symposium on Applied Computing (SAC 2003), Melbourne, FL.
-
M. Vendruscolo, E. Kussell and E. Domany. 
Recovery of Protein Structure from Contact Maps. 
Folding and Design, 2:295-306, 1997
-
Y. Bengio, P. Simard and P. Frasconi. 
"Learning Long-Term Dependencies with Gradient Descent is Difficult". 
IEEE Transactions on Neural Networks, 5(2): 157--166, 1994.
-
A. Vullo and P. Frasconi. 
A Bi-Recursive Neural Network Architecture for the Prediction of Protein Coarse Contact Maps. 
Proceedings of the 1st IEEE Computer Society Bioinformatics Conference, Stanford Univ. August 2002.
Copyright notice
The documents listed in this site are provided as a means to ensure timely dissemination
of scholarly and technical work on a noncommercial basis. Copyright and all rights therein
are maintained by the authors or by other copyright holders, notwithstanding that they have
offered their works here electronically. It is understood that all persons copying this
information will adhere to the terms and constraints invoked by each author's copyright.
These works may not be reposted without the explicit permission of the copyright holder.
top
|
8th July 2003. Machine Learning and Neural Networks Group.
For questions and comments: aceroni@dsi.unifi.it.
|