Text File Format Identification: An Application of AI for the Curation of Digital Records

Santhilata Kuppili Venkata; Paul Young; Alex Green

Title (en)

Language

English

Description (en)

File format identification is a necessary step for the effective digital preservation of records. It allows appropriate actions to be taken for the curation and access of file types. The National Archives has existing processes for dealing with binary file format types, using tools such as PRONOM and DROID. These methods rely on using header information (metadata) and consistent binary sequences. However, these are not appropriate for the identification of text le formats as these do not contain recognisable header information or consistent patterns. Most text formats can be opened as plain text files, however file type information is often needed to understand the files use and context. Automated methods are necessary for text file format identification due to the scale of digital records processed by The National Archives, UK. An Artificial Intelligence methodology was tested and implemented using representative data collected from the GitHub repositories of UK Government departments. The first prototype developed has achieved reasonably good performance in successfully detecting five file formats with similar characteristics. The results encourage us to carry out additional experiments to include further text file format types.

Keywords (en)

Text file formats, Supervised learning, digital preservation

Author of the digital object

Santhilata Kuppili Venkata (The National Archives)

Author of the digital object

Paul Young (The National Archives)

Author of the digital object

Alex Green (The National Archives (UK))

Format

application/pdf

Size

243.9 kB

Licence Selected

CC BY 4.0 International

Conferences

Conference 2021

Citable links

Persistent identifier
https://phaidra.univie.ac.at/o:1424885
Handle
https://hdl.handle.net/11353/10.1424885
Content

Download (243.9 kB)
Details

Uploader

iPRES Archiv

Object type

PDFDocument

Format

application/pdf

Created

23.02.2022 12:26:42 UTC
Usage statistics

-

-
This object is in collection

iPRES 2021 - Proceedings of 17th International Conference on Preservation

Open Access Collection

Openaire v3.0 collection
Metadata

Metadata XML
Export formats

Dublin Core

DataCite

LOM

EDM

OpenAIRE