Title (en)
PDF Mayhem: Is Broken Really Broken?
Subtitle (en)
iPres 2018 - Boston
Language
English
Description (en)
In this paper, we focus on the quality of PDF files. We are interested in errors that validators report during the validation process: how accurate are these errors and can we build easy workarounds to avoid or even fix these problems? We present our findings from a pilot experiment where we validated more than 200,000 PDF files from well-known corpora with different validators and found several thousand problematic files. We then devised a process of reconstructing the invalid files and analyzing the converted data. Our results show that there are potentially working methods for avoiding problems during the PDF validation and these methods can significantly reduce the workload for preservation specialists who are responsible for the quality of the data. Our further aim is to master and manage PDF validation so that we can build an automated workflow which is able to migrate most of PDF files to PDF/A files during the ingest of a digital preservation repository. To achieve this in reliable manner we need further studies to build on what we have presented here.
Keywords (en)
iPres 2018, Boston
DOI
10.17605/OSF.IO/FZXC9
Author of the digital object
Juha Lehtonen
Author of the digital object
Heikki Helin
Author of the digital object
Johan Kylander
Author of the digital object
Kimmo Koivunen
Format
application/pdf
Size
277.7 kB
Licence Selected
Conferences
Conference 2018
- Citable links
Persistent identifier
https://phaidra.univie.ac.at/o:923651Handle
DOI
https://hdl.handle.net/11353/10.923651
https://doi.org/10.17605/OSF.IO/FZXC9 - Content
- DetailsUploaderObject typePDFDocumentFormatapplication/pdfCreated05.01.2019 04:50:16 UTC
- Usage statistics--
- Metadata
- Export formats
