You are here: University of Vienna PHAIDRA Detail o:923651
Title
PDF Mayhem: Is Broken Really Broken?
Subtitle (en)
iPres 2018 - Boston
Language
English
Description (en)
In this paper, we focus on the quality of PDF files. We are interested in errors that validators report during the validation process: how accurate are these errors and can we build easy workarounds to avoid or even fix these problems? We present our findings from a pilot experiment where we validated more than 200,000 PDF files from well-known corpora with different validators and found several thousand problematic files. We then devised a process of reconstructing the invalid files and analyzing the converted data. Our results show that there are potentially working methods for avoiding problems during the PDF validation and these methods can significantly reduce the workload for preservation specialists who are responsible for the quality of the data. Our further aim is to master and manage PDF validation so that we can build an automated workflow which is able to migrate most of PDF files to PDF/A files during the ingest of a digital preservation repository. To achieve this in reliable manner we need further studies to build on what we have presented here.
Keywords (en)
iPres 2018, Boston
DOI
10.17605/OSF.IO/FZXC9
Author of the digital object
Juha   Lehtonen
Author of the digital object
Heikki  Helin
Author of the digital object
Johan  Kylander
Author of the digital object
Kimmo  Koivunen
Format
application/pdf
Size
277.7 kB
Licence Selected
CC BY 4.0 International
Conferences
Conference 2018
Content
Details
Object type
PDFDocument
Format
application/pdf
Created
05.01.2019 05:50:16
Metadata