ICDAR 2011 Tutorial

September 18th 2011

www.icdar2011.org




Ebooks: challenges and opportunities for

Document Analysis research

 Simone Marinai


 

 

 

In the last years the interest in e-book readers is growing, following the growth of sales in electronic books. Two main document formats are accepted by most devices: PDF and ePub. The PDF format is widely used to share documents allowing a cross-platform readability. However, it is not ideal for a comfortable reading on small screens. On the opposite, the ePub format is re-flowable and is well suited for e-book readers.

In this tutorial we analyze the challenges and opportunities for the Document Analysis research with respect to these devices and document formats. In particular, we first describe the main features of dedicated e-book readers and of the various file formats supported by most devices. We will subsequently analyze in more details the standard ePub format with hands-on demonstration of most popular open source software for conversion and editing of ebooks.

In the second part we point out the problems that are faced by most tools to convert complex documents such as scientific and technical papers. We first analyze one system that we developed for the conversion of PDF books to ePub. In this system we invert the text formatting made during the pagination. To this purpose, layout analysis techniques are performed at the book level in order to identify the book's table of contents and the main functional areas of the book such as chapters, paragraphs, and notes.

In the last part we will address ongoing research related to the conversion of scientific and technical documents that are more difficult to handle. In particular, the presence of mathematical equations, tables, and illustrations in multi-column layouts require the integration of document analysis techniques with information extraction algorithms. Among others, techniques related to layout analysis, graphical symbol recognition, mathematical expression analysis, table understanding are relevant in this application area. We will also discuss open problems related to the use of relatively small screen devices to properly display complex objects such as tables and chemical drawings.  

 

Detailed  Outline

The tutorial will last 1/2 day with a coffee-break between parts A and B.

 

Introduction

PART A

  1. Ebook devices.
  2. File formats for ebook devices: fixed layout vs. re-flowable layout.
  3. The epub and PDF formats.
  4. Repositories of free ebooks.
  5. Open source tools for generating and converting documents in epub.

 

PART B

  1. Converting PDF books in epub.
  2. Problems with scientific and technical documents:

·         Illustrations

·         Equations

·         Tables

  1. Problems with digitized documents where DIAR research can help.
  2. Other issues: handling annotations, retrieving books on visual features.

 

 

Intended audience:

Most Document Analysis researchers can be interested in the topics covered in the tutorial. The tutorial will be introductory and therefore the technical details, such as file formats, will be addressed only at a high level. A basic knowledge of Document Analysis techniques and of the main problems addressed by the ICDAR research is useful, as well as some basic knowledge of XML.

The tutorial topics are of large interest in the ICDAR community. Someone speculates that the physical book will be dead in a few years. Whether this will become reality or it will be another version of the myth of the paperless world, it is important for DIAR researchers to know more about the details of these technologies.


About the presenter:

    Simone Marinai is currently an Assistant Professor at the University of Firenze, Italy. He received his PhD from the University of Firenze, Italy in 1996.

    Dr Marinai is editor in chief of the International Journal on Document Analysis and Recognition (IJDAR) and of the Electronic Letters on Computer Vision and Image Analysis (ELCVIA) journal; Chair of the Conferences and Meetings (C&M) committee of IAPR; Past chair of the  IAPR Technical Committee on Neural Networks and Computational Intelligence (TC3); Steering Committee member of DAS and DIAL workshops; Advisory board member of ICDAR; Publicity  Co-Chair or ICDAR-2009; Co-chair of DAS 2004, ANNPR 2003, ANNPR 2006, ANNPR 2008, NNLDAR 2005, DAUDD 1999;  Program Committee member of most current and recent editions of conferences in various research fields: ICPR, ICDAR, DAS, CFED, among the others. He is co-editor of the book "Machine Learning in Document Analysis and Recognition" published by Springer Verlag in 2008.

 

In March 2011 Simone Marinai has been invited to make a presentation at the first Italian conference on Ebooks (EbookLab Italia).

In 2009 he held, together with Apostolos Antonacopoulos, a tutorial at ICDAR on “Digital Libraries and Historical Document Processing”. He organized a tutorial titled "Artificial Neural Networks for Document Analysis and Recognition", held during ICDAR 2001 and ICPR 2002, respectively. He offered a tutorial titled "Document Image Analysis and Recognition" at the European Conference on Digital Libraries (ECDL) 2002.


Related talk

Here ()  it is possible to watch an introductory talk (in Italian) given at EbookLab Italia 2011, the first Italian conference on Ebooks.
The technical contents at this talk were very limited with respect to the ICDAR tutorial. 


References

 

S. Marinai, E. Marino, G. Soda Conversion of PDF books in ePub format, accepted at ICDAR 2011

 

S. Marinai, E. Marino, G. Soda Table of contents recognition for converting PDF documents in e-book formats. ACM Symposium on Document Engineering 2010: pp. 73-76

 

S. Marinai, Metadata Extraction from PDF Papers for Digital Library Ingest, Proc. ICDAR 2009, IEEE, pp. 251-255 2009.

 

S. MarinaiA survey of document image retrieval in digital libraries.  9th Colloque International Francophone sur l'Ecrit et le Document (CIFED 2006), pp. 193-198.

Open source tools and samples of ebooks will be distributed to tutorial participants together with handouts of the slides.

 

 

Address:

Simone Marinai, Dipartimento di Sistemi e Informatica, Università di Firenze, via S. Marta, 3, 50139 – Firenze, Italy. Phone +39 055 4796452. Fax: +39 055 4796363. Email: simone.marinai@unifi.it