- Simone Marinai
Document image analysis and recognition (DIAR) is a research field that has its roots in the first Optical Character Recognition (OCR) systems, applied for reading numeric check codes. Nowadays, the technology related to DIAR is used in a broad range of applications, where some information has to be extracted from structured documents existing in different media. Typical applications include, among the others, handwritten character recognition, processing of textual web images, and information extraction from digital libraries. In the digital library community a lot of efforts have been devoted to the digitization of paper collections in order to archive them as document image collections. Large digital archives hare currently available, however their full fruition can be achieved only by accessing the information that is embedded in the digital image. The simple application of Optical Character Recognition (OCR) packages can only partially solve these problems, both for the difficulty of obtaining clean converted text and for the lack of structural description of the document. To tackle this problems either layout analysis methods or document image retrieval approaches can be considered. This tutorial will provide a first introduction to most important tasks in DIAR, from low level document image processing to high level applications (see the tutorial summary). Some applications in the digital library field will be described with more details. The tutorial is supported with slides distributed to participants and with an extensive bibliographic reference. In addition, when appropriate, commercial products and publicly available software for dealing with described tasks will be discussed.
This introductory tutorial is addressed to researchers and students, as well as to technical people, interested in an introduction to problems, solutions and research directions in this field. System integrators can appreciate the discussion of features of commercial products used for document image processing and OCR, whereas researchers and students can be attracted by pointers to the status of the art in the research related to the common aspects of DIAR and digital library applications. A general background in computer science is required, and most basic concepts of document imaging will be provided in the first part of the tutorial.
Scanning and storage
OCR and handwriting recognition
Document image retrieval
Digital library applications
Simone Marinai received the Laurea in Electronic Engineering in 1992,
from the University of Florence, Italy. He obtained the PhD degree
in computer science in 1996 with a thesis on the extraction of information
from structured documents.
In 1995 he has been a visiting scientist at Cenparmi lab (Concordia University - Montreal Canada).
His main research interests are in pattern recognition, neural
networks, and document processing applications.
Currently he is Assistant Professor at University of Florence,
where he teaches, among the others, DIAR methods in the Artificial Intelligence course.
In 2001 he was co-author of the tutorial `` Artificial Neural Networks for Document
Analysis and Recognition'' that was organized in conjunction with the Int. Conference
on Document Analysis and Recognition (Seattle, USA). The same tutorial will be organized
in conjunction with next ICPR (Quebec city, Canada, Aug. 2002).
Simone Marinai is the technical representative of DSI in METAe (`` The Metadata Engine Project'', an
EU-founded project), that
is aimed at the automatic extraction of structural meta-data from scanned documents
belonging to digital libraries.
He was the chairman of the workshop `` Document Analysis and Understanding
for Document Databases'' (DAUDD) held in 1999 in conjunction with the DEXA conference.
He is member of several conference program committees, and he is currently Associate
Editor of the `` Electronic Letters on Computer Vision and Image Analysis'' journal.