Document Analysis group

Dipartimento di Sistemi e Informatica
University of Florence
Via Santa Marta 3
50139 Firenze - Italy

Research and Publications

People


Simone Marinai, Emanuele Marino, Beatrice Miotti, Giovanni Soda


Introduction


The activity of Dante research group began in 1993 with first appications of unconstrained handwritten character recognition by means of artificial neural networks. In the following years we addressed other application domains including form processing, layout analysis, digital libraries, and more recently document image retrieval.
One of the main peculiarities of our research is the use of Artificial Neural Networks in several application domains. The interaction between Artificial Neural Networks and Document Image Analyis has been the subject of one tutorial Artificial Neural Networks for Document Analysis and Recognition held at ICDAR 2001 and ICPR 2002 and has been described in a survey paper [PAMI05].

top


Document Image Retrieval


The traditional approach in Document Image Analysis aims at performing a complete and accurate extraction of the informative content in document images. This strategy is appropriate only for small size collections and when data have a significant commercial value. This is not the case of Digital Library where different strategies could be considered.

One strategy is to adopt document image retrieval based on layout similarity. In this approach the user identifies one page in the database and most similar pages are afterwards identified by the systems and shon to the user. In [DAS02] a system for layout-based Document Image Retrieval where pages are represented by means of MXY trees described with a suitable representation. Recent investgations have considered the use of tree transformation rules in order to improve the document retrieval [ICDAR05] [avivdlib05].

A DIAR system cannot avoid to take into account the textual page contents. This point of view is the subject of a paper [PAMI06] where we described a system for the document retrieval on the basis of keywords needed by the user. One salient feature of the proposed approach is the independence, during the indexing with respect to the specifig language and font of the stored documents.

The methods for layout-based and textual-based document retrieval have been integrated into a single system that allows users to retrieve relevant documents combining the two basic approaches in several ways [DIAL04].

top


Layout analysis and page classification


The segmentation of document images is aimed at identifying regions having a homogeneous content that can be subsequently processed with appropriate techniques. Methods based on MXY tres are well known and are based aon a recursive segmentation of the page along white spaces that span the whole image. With the aim of processing pages containing horizontal and vertical ruling lines we proposed in [ICDAR99] the MXY tree segmentation algorithm. The MXY trees can be used both for page segmentation and for document page classification.

In the STRETCH (STorage and RETrieval by content of imaged documents) European Project we developed a system for the storage and retrieval of commercial invoices that is based on the extraction of a suitable symbolic description of the invoice structure by means of MXY trees [IJDAR02][\ref{}].

We recently extended our research towards the information extraction from documents (books and journals) belonging to Digital Libraries. The DSI participation to the METAe (the Metadata Engine) project was mainly devoted to the classification of book and journal pages by means of artificial neural networks dealing with the MXY tree page representation [DEXA01].

MXY trees have been also used for tbale location in technical papers [ICPR02]. Other approaches are related to the use of tree grammars for training set expansion for improving page classification performance [ICDAR03b].

top


Logo Recognition


The logo recognition is an usefull tool for document identification (for instance for the classification of commercial invoices). Differently from character recognition in logo recognition the number of classes is not fixed and can change dinamically. It is therefore appropriate to build modular classifiers like the one proposed in [GREC97] that is based on autoassociators.

The autoassociators have been used also for the recognition of rotated and noisy logo [PR03]. In this case we modified the training algorithm of autoassociators so as to take into account the information corresponding to the logo contour and reduce the effect of blobs of noise (e.g. black or white stripes).

top


Form and invoice reading systems


The main difficulties in form processing is the document registration and the identification of the information fields that are not placed in fixed positions in the page. A new model for processing variable layout forms has been proposed in [ICDAR95] and [DEXA95] where we discussed also suitable algorithms for the document registration and information field location. The proposed approach has been demonstrated into a running system for the description and analysis of the layout of structured documents [PAMI98].

Similar techniques have been applied also to the reading of commercial invoices [DEXA97] and for the semi-automatic labeling of columns in the invoices [ICDAR97b].

top


Printed and handwritten character recognition


Key components of most document processing systems are the modules devoted to the interpretation of handwritten and printed text. In this context we proposed a noise model [GREC97] that can be applied to grey level images in order to generate sintetic patterns to be used for classifier training.

Modular classifiers are frequently used in order to improve the performance of individual classifiers. In this context, we developed techniques for a serial combination of neural classifiers in the context of an OCR system [IJDAR01]. This approach is based on a preliminary classification based on an MLP (MultiLayer Perceptron) followed by a refinement made with autoassociator-based classifiers that are identified considering the confidence of the MLP.

top

Links


o  INFORMys  Demo of the program. 
o  INFORMys  Data available. 

		  


References


[ICIAP09]
S. Marinai, E. Marino, G. Soda Nonlinear Embedded Map Projection for Dimensionality Reduction, Proc. ICIAP 09, Springer Verlag, 2009.
[ICDAR09a]
S. Marinai, Metadata Extraction from PDF Papers for Digital Library Ingest, Proc. ICDAR 2009, IEEE, pp. 251-255 2009.
[ICDAR09b]
S. Marinai, B. Miotti, G. Soda, Mathematical Symbol Indexing Using Topologically Ordered Clusters of Shape Contexts, Proc. ICDAR 2009, IEEE, pp. 1041-1045, 2009.
[AND09]
S. Marinai, Text retrieval from early printed books, Proc. AND Workshop 2009, ACM, pp. 33-34, 2009.
[SPR08]
S. Marinai, E. Marino, G. Soda Embedded map projection for dimensionality reduction based similarity search, Proc. S+SSPR 2008, Springer Verlag, 2008.
[DAS08]
S. Marinai, E. Marino, G. Soda A comparison of clustering methods for word image indexing, Proc. DAS 2008, Springer Verlag, 2008.
[MLDAR08a]
S. Marinai, Introduction to Document Analysis and Recognition, In Machine Learning in Document Analysis and Recognition, Studies in Computational Intelligence 90, Ed. Simone Marinai, Hiromichi Fujisawa, Springer Verlag, 2008.
[MLDAR08b]
S. Marinai, E. Marino, G. Soda, Self-Organizing Maps for Clustering in Document Image Analysis, In Machine Learning in Document Analysis and Recognition, Studies in Computational Intelligence 90, Ed. Simone Marinai, Hiromichi Fujisawa, Springer Verlag, 2008.
[ICIAP07]
S. Marinai, E. Marino, G. Soda.  Transformation invariant SOM clustering in Document Image Analysis.  14th International Conference on Image Analysis and Processing, Modena (Italy), 2007, IEEE Press:pp. 185-190, 2007
[ECDL07]
S. Marinai, E. Marino, G. Soda.  Exploring Digital Libraries with Document Image Retrieval.  11th European Conference on Research and Advanced Technology for Digital Libraries, Budapest (Hungary), 2007, Springer Verlag, pp. 368-379.
[MYS07]
S. Marinai.  SOM clustering for text retrieval and classification with examples on Indian scripts.  Proc. of Brainstorming Workshop on OCR for Indian Languages 16-17 March, 2007, Mysore (India).
Invited talk
[PAMI06]
S. Marinai, M.Gori, G.Soda, Font Adaptive Word Indexing of Modern Printed Documents, IEEE Transaction PAMI, vol 28, N. 8, August 2006, pp. 1187-1199, IEEE Press, Los Alamitos (CA).
[CIFED06]
S. Marinai.  A survey of document image retrieval in digital libraries.  9th Colloque International Francophone sur l'Ecrit et le Document (CIFED 2006), pag. 193-198.
Invited talk
[DAS06]
S. Marinai, S. Faini, E. Marino, G. Soda.  Efficient word retrieval by means of SOM clustering and PCA.  7th International Workshop on Document Analysis Systems}, Nelson (New Zealand), 2006, LNCS: pp.
[DIAL06]
S. Marinai, E. Marino, G. Soda, Tree clustering for layout-based document image retrieval, Proceedings of the Second Int'l Workshop on Document Image Analysis for Libraries, pp. 243-251, Lyon (France), 2006, IEEE Press, Los Alamitos (CA).
[DAS06]
S. Marinai, S. Faini, E. Marino, G. Soda.  Efficient word retrieval by means of SOM clustering and PCA.  7th International Workshop on Document Analysis Systems}, Nelson (New Zealand), 2006, LNCS: pp.
[PAMI 05]
S. Marinai, M.Gori, G.Soda, Artificial Neural Networks for Document Analysis and Recognition, IEEE Transaction PAMI, vol 27, N. 1, January 2005, pp. 23-35, IEEE Press, Los Alamitos (CA).
[ICDAR05]
S. Marinai, E. Marino, G. Soda.  Layout based document image retrieval by means of XY tree reduction.  9th International Conference on Document Analysis and Recognition}, Seoul (Korea), 2005, IEEE Press:pp. 432-436, 2005
[NNLDAR05]
S. Faini, S. Marinai, E. Marino, G. Soda, SOM-based Document Image Retrieval, Proceeding of the 1st International IAPR Workshop on Neural Networks and Learning in Document Analysis and Recognition, pp. 33 -- 40, Seoul (Korea), 2005.
[AvivDlib05]
S. Marinai, E. Marino, G. Soda, Layout based document image retrieval in Digital Libraries, Proceeding of the 7th Int. Workshop Audio-Visual Content and Information Visualization in Digital Libraries (AVIVDiLib '05), Cortona (Italy), 2005 pp.67-76.
[DIAL04]
S. Marinai, E. Marino, F. Cesarini, G. Soda, A general system for the retrieval of document images from digital libraries, Proceedings of the First Int'l Workshop on Document Image Analysis for Libraries, pp. 150-173, Palo Alto (CA), 2004, IEEE Press, Los Alamitos (CA).
[PR03]
M. Gori, M. Maggini, S. Marinai, J. Q. Sheng, G. Soda, Edge-Backpropagation for Noisy Logo Recognition, Pattern Recognition, vol 36, N.1, 2003, pp. 103-110, Elsevier, Amsterdam (NL).
[ICDAR03a]
S. Marinai, E. Marino, G. Soda, Indexing and Retrieval of Words in Old Documents, Proceedings of ICDAR 2003, pp. 223-227, 2003, IEEE Press, Los Alamitos (CA).
This paper won the Best Paper Award at ICDAR 2003.
[ICDAR03b]
S. Baldi, S. Marinai, G. Soda, Using tree grammars for training set expansion in page classification, Proceedings of ICDAR 2003, pp. 829-833, 2003, IEEE Press, Los Alamitos (CA).
[IJDAR02]
E. Appiani, F. Cesarini, A.M. Colla, M. Diligenti, M.Gori, S.Marinai, G.Soda, Automatic document classification and indexing in high-volume applications, IJDAR, vol 4, N. 2 2001, pp. 69-83, Springer-Verlag, Berlin (D).
[ICPR02]
F. Cesarini, S. Marinai, L. Sarti, G. Soda, Trainable table location in document images, Proceedings of the 16th ICPR, pp. 236-240, Queb�c City (Canada), August 2002, IEEE Press, Los Alamitos (CA).
[DAS02]
F. Cesarini, S. Marinai, G. Soda, Retrieval by layout similarity of documents represented by MXY trees, Proceedings of the 5th IAPR International Workshop on Document Analysis Systems (DAS), pp. 353-364 Princeton (NJ, USA), August 2002, LNCS 2423, Springer-Verlag, Berlino (D).
[IJDAR01]
E. Francesconi, M.Gori, S.Marinai, G.Soda, A serial combination of connectionist-based classifiers for OCR, IJDAR, vol 3, N. 3 2001, pp. 160-168, Springer-Verlag, Berlin (D).
[ICDAR01]
F. Cesarini, M. Lastri, S. Marinai, G. Soda, Encoding of modified X-Y tress for document classification, Proceedings of ICDAR 2001, pp. 1131-1136, Seattle (USA), 2001, IEEE Press, Los Alamitos (CA).
[DEXA01]
F. Cesarini, M. Lastri, S. Marinai, G. Soda, Page classification for meta-data extraction from digital collections, Proceedings of DEXA 2001, Munich (D), 2001, pp. 82-91, LNCS 2113, Springer-Verlag, Berlin (D).
[ICDAR99a]
F.Cesarini, M. Gori, S. Marinai, G. Soda, Structured Document Segmentation and Representation by the Modified X-Y Tree, Proceedings of ICDAR 1999, pp. 563-566, Bangalore (India), 1999, IEEE Press, Los Alamitos (CA).
[ICDAR99b]
S. Marinai, P. Nesi, Projection based Segmentation of Musical Sheets, Proceedings of ICDAR 1999, pp. 563-566, Bangalore (India), 1999, IEEE Press, Los Alamitos (CA).
[PAMI98]
F.Cesarini, M.Gori, S.Marinai, G.Soda, INFORMys: a flexible INvoice-like FORM reader system, IEEE Transaction PAMI, vol 20, N. 7 July 1998, pp. 730-745, IEEE Press, Los Alamitos (CA).
[GREC97]
E.Francesconi, P.Frasconi, M. Gori, S. Marinai, J.Q. Sheng, G. Soda, A. Sperduti, Logo Recognition by Recursive Neural Networks, in Graphics Recognition, Algorithms and Systems, LNCS (1389) pp. 104 - 117, 1998, Springer Verlag, Berlino (D).
[DEXA97]
F.Cesarini, E.Francesconi, M.Gori, S.Marinai, J.Q.Sheng, G.Soda, Conceptual Modelling for Invoice Document Processing, Proceedings of the Conference DEXA '97 Workshop on Query Processing in Multimedia Information System, Toulose, September 1997, pp. 596-603, IEEE Press, Los Alamitos (CA).
[ICDAR97a]
F.Cesarini, E.Francesconi, M. Gori, S. Marinai, J.Q. Sheng, G. Soda, A Neural-based architecture for spot-noisy logo recognition, Proceedings of ICDAR 1997, pp. 175-179, Ulm (Germany), 1997, IEEE Press, Los Alamitos (CA).
[ICDAR97b]
F.Cesarini, E.Francesconi, M. Gori, S. Marinai, J.Q. Sheng, G. Soda, Rectangle labelling for an Invoice Understanding System, Proceedings of ICDAR 1997, pp. 324-330, Ulm (Germany), 1997, IEEE Press, Los Alamitos (CA).
[GREC97]
F.Cesarini, M.Gori, S.Marinai, G.Soda, A Hybrid System for Locating Low Level Graphic Items, in Graphics Recognition, Methods and Applications, LNCS (1072) pp. 135 - 147, 1996, Springer Verlag, Berlino (D).
[DEXA95]
F. Cesarini, M. Gori, S. Marinai, G. Soda, Data Extraction from Form Images, Proceedings of DEXA 1995, London (UK), 1995, pp. 438-448, LNCS 978, Springer-Verlag, Berlin (D).
[ICDAR95]
F.Cesarini, M. Gori, S. Marinai, G. Soda, A System for Data Extraction from Forms of Known Class, Proceedings of ICDAR 1995, pp. 1136-11409, Montreal, 1995, IEEE Press, Los Alamitos (CA).

Copyright notice

The documents listed in this site are provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

top

1st November 2007. Dante Group.